Data Formats

Data Formats: HTML and JSON

Sentinel Scout supports multiple output formats so developers and AI agents can choose the structure best suited to their workflows. The output format is defined using the outputFileExtension parameter when submitting a scraping job.

Note: For full parameter options and supported tags, refer to the API Docs (Advanced) section.


1. HTML Format

  • Option: "outputFileExtension": "EXTENSION_HTML"

  • Description: Returns the cleaned HTML of the target page, with unwanted tags (defined in tagsToStripOff) removed.

  • Use cases:

    • Preserving the layout and structure of the original page.

    • Further parsing with custom HTML processing tools.

    • Building archives of web content with visual fidelity intact.

Example Request (HTML):

curl -X POST "https://api.scout.sentinel.co/api/v1/probe/sync" \
  -H "Authorization: Bearer <API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://en.wikipedia.org/wiki/Machine_learning",
    "countryCode": "US",
    "outputFileExtension": "EXTENSION_HTML"
  }'

Example Response (HTML Snippet):

<!DOCTYPE html>
<html>
<head>
  <title>Machine learning - Wikipedia</title>
</head>
<body>
  <h1>Machine learning</h1>
  <p>Machine learning (ML) is a field of artificial intelligence ...</p>
</body>
</html>

2. JSON Format

  • Option: "outputFileExtension": "EXTENSION_JSON"

  • Description: Parses the scraped content into structured JSON when possible, identifying key-value pairs, tables, and text blocks.

  • Use cases:

    • Feeding data directly into ML pipelines.

    • Programmatic analysis (e.g., sentiment analysis, keyword extraction).

    • Building structured datasets for research.

Example Request (JSON):

curl -X POST "https://api.scout.sentinel.co/api/v1/probe/sync" \
  -H "Authorization: Bearer <API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://en.wikipedia.org/wiki/Machine_learning",
    "countryCode": "US",
    "outputFileExtension": "EXTENSION_JSON"
  }'

Example Response (JSON Snippet):

{
  "title": "Machine learning - Wikipedia",
  "headings": [
    "Definition",
    "History",
    "Applications"
  ],
  "content": [
    {
      "section": "Introduction",
      "text": "Machine learning (ML) is a field of artificial intelligence ..."
    }
  ]
}

Coming Soon: Markdown Format

  • Option: "outputFileExtension": "EXTENSION_MARKDOWN"

  • Description: Will output scraped data in Markdown, making it lightweight, easy to read, and simple to integrate into content systems.


Note: Always choose the format based on your downstream workflow. For archiving and visual fidelity, use HTML. For structured AI/ML pipelines, use JSON. For lightweight content workflows, Markdown support is on the roadmap.

Last updated