Extracting Metadata from Public Procurement with JavaScriptΒΆ

This guide explains how to use Sosse to crawl procurement notices published on the European public procurement portal, TED (Tenders Electronic Daily, available at ted.europa.eu). TED is the official platform for publishing public procurement notices from across Europe. We’ll show how to extract metadata from these offer pages using JavaScript, covering setup, crawling, and export.

Screenshot of public offers search on ted.europa.eu

Setting up a Crawl Policy for Metadata ExtractionΒΆ

Crawl policies define how Sosse interacts with targeted web pages. In this scenario, we want to extract structured data (such as offer title, deadline, country, etc.) from public offer listings using JavaScript. For more background, see the ⚑ Crawl Policies documentation.

  • Navigate to Crawl Policies: Go to the ⚑ Crawl Policies page from the admin menu.

  • Create a New Policy: - In the ⚑ Crawl tab, set an URL regex that targets the public offer detail pages on TED:

    ^https://ted.europa.eu/en/notice/-/detail/[0-9]*-.*
    
    • In the 🌍 Browser tab, set Default browse mode to Firefox or Chromium. We select a browser that can execute JavaScript.

    • In the same tab, in the Script field, provide a script that will run in the browser context of each page. Any data returned by the script is used to update the data of the crawled URL. Content-specific metadata can be stored in the metadata field (see Crawl Policy Script).

      You can write your own script or use AI tools such as GitHub Copilot or ChatGPT to generate a script. To get started, visit an example offer page, such as https://ted.europa.eu/en/notice/-/detail/123456-2024 and inspect the elements you want to extract:

      return {
        metadata: {
          title: document.querySelector('h1')?.innerText || '',
          ...
        }
      };
      
    • Under the πŸ•‘ Recurrence tab, set Crawl frequency to Once to avoid re-crawling the same articles.

../_images/guide_data_extract_crawl_policy.png

Searching for Public Offers and Queuing URLsΒΆ

Screenshot of public offers search on ted.europa.eu
  • Queue Search Result URLs in Sosse:

    • Copy the URLs of the offer detail pages you wish to crawl.

    • Go to the Crawl a new URL page in Sosse and paste the URLs.

    • Click Confirm to queue the crawl jobs.

Note

By default, this will crawl the offers and regularly check for new ones as defined in the ⚑ Crawl Policy. See πŸ•‘ Recurrence.

Reviewing Extracted ResultsΒΆ

After the crawl jobs complete, you can review the extracted metadata in several ways:

  • From the Document Page: Go to the πŸ”€ Documents page to view the extracted data in the πŸ“Š Metadata section.

../_images/guide_data_extract_document_metadata.png
  • CSV Export: On the Searches page, use the CSV Export feature to download the results.

../_images/guide_data_extract_csv_export.png
  • Rest API: Access the extracted results via the Rest API, which allows programmatic access to the data.

Additional OptionsΒΆ

By combining Sosse’s crawling and JavaScript extraction features, you can efficiently monitor TED’s public offer portal, extract structured data, and automate notifications.

To stay updated about new or changed offers, you can: