Guidelines for Ethical Use¶
When using a web crawler or scraper, it’s important to be responsible and ethical. Here are some quick tips to keep in mind:
Get Permission First¶
Before crawling a site, make sure you’re allowed to access it. Although your crawler may have the ability to ignore robots.txt or modify the User-Agent, always respect the site owner’s preferences:
Read the site’s terms of service to see if scraping is allowed.
If you’re unsure, consider reaching out to the site owner for permission.
Crawl Responsibly & Respect the Environment¶
Crawling can impact both website performance and the environment. Here’s how to do it responsibly: - Avoid Overloading Servers: Don’t make too many requests at once or crawl the same pages repeatedly.
Use Data Dumps: If available, use downloadable data dumps (e.g., Kiwix) instead of crawling, which helps reduce server load and saves resources.
Consider Environmental Impact: Crawling consumes energy. Keep your crawls efficient—only collect the data you need, and avoid unnecessary large downloads like media files.
Use APIs When Available: If the website provides an API, prefer using it instead of crawling, as APIs are optimized for data access and reduce server load.
Prefer Generating Scripts with AI: When possible, use AI to generate scripts for structured data extraction rather than parsing unstructured pages, which can be less efficient and error-prone.
Respect the Web¶
Ethical scraping is all about respect:
Be transparent and let site owners know if you’re crawling their content.
Avoid scraping personal or sensitive information unless explicitly allowed.
Follow copyright laws and properly attribute sources.
Use scraping tools with the right intent — to learn, build, and contribute, not to exploit or deceive.
For more information, see Is Web Scraping Legal?.