Excluding/Including Tags

When scraping with Sentinel Scout, you can clean the output by removing or keeping specific HTML tags. This makes the data easier to use, lighter to store, and more consistent for downstream AI/ML pipelines.

Note: This section provides only an overview. The full API documentation with detailed request/response schemas is available in the API Docs (Advanced) section and should be referred to for implementation.

  • tagsToStripOff → List tags to remove (e.g., <script>, <style>, <iframe>).

  • Default behavior → Keeps all tags unless specified.

  • Use cases:

    • Remove scripts and ads for cleaner content.

    • Keep <b> or <i> tags if formatting is required for downstream tasks.

Example Request

curl -X POST "https://api.scout.sentinel.co/api/v1/probe" \
  -H "Authorization: Bearer <API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://en.wikipedia.org/wiki/Artificial_intelligence",
    "countryCode": "US",
    "tagsToStripOff": [
      "script",
      "style",
      "iframe",
      "noscript"
    ],
    "fallBackRouting": true,
    "antiBotScrape": true,
    "outputFileExtension": "EXTENSION_HTML"
  }'

Example Response

{
  "taskId": "b45d8f23-c918-46ff-9c88-541d7f2a6e01",
  "status": "SUBMITTED",
  "message": "Scraping job submitted successfully. Use the taskId to check its status."
}

Note: For full parameter options and supported tags, refer to the API Docs (Advanced) section.

Last updated