en
Español
中國人
Tiếng Việt
Deutsch
Українська
Português
Français
भारतीय
Türkçe
한국인
Italiano
Gaeilge
اردو
Indonesia
Polski Web crawlers are used for price monitoring, news aggregation, competitor analysis, search engine indexing, and other tasks that require structured data collection from websites. This guide walks through how to build a web crawler from scratch, starting with project planning and technology choices, and ending with environment setup and data storage. It provides a foundation you can later extend to more complex and large-scale projects.
It is a program that automatically visits web pages and collects information from them. It works by sending HTTP calls to a site, retrieving the HTML of each page, and then processing that HTML to extract the required data. After that, it follows internal links and repeats the process until it reaches predefined limits or stop conditions. This process is not the same as web scraping. For a detailed comparison, see Web scraping vs web crawling.
Such tools are widely used for:
In these scenarios, when you build your own web crawler is often the better approach: you can tune the program to your exact needs, control request frequency, and define what information pieces to collect and how.
Before you start coding, define the core parameters of your project to avoid common issues and ensure stable operation.
A well-planned tool works reliably, uses resources efficiently, and provides high-quality results.
You can build a web crawler in multiple programming languages, including Python, Java, and PHP. Python stands out for its simple syntax and rich ecosystem of libraries for HTTP requests and HTML parsing (such as requests, BeautifulSoup, lxml). Java is a solid choice for large-scale and enterprise projects. PHP is more common in web development and less convenient for standalone crawlers.
For a first try, we are going to consider how to build a web crawler in Python as it is usually the optimal choice because it lets you implement and test basic functionality quickly.
Start by installing Python from the official website. Then install the core libraries you’ll use: requests for sending HTTP requests and BeautifulSoup for parsing HTML:
pip install requests beautifulsoup4
It’s also worth organizing your project structure from the beginning: separate files for main logic, configuration, and utilities. This makes future maintenance and scaling much easier.
A basic script can consist of three main parts: sending a request, processing HTML, and following links.
from bs4 import BeautifulSoup
import time
import random
# Configuration
url = "https://quotes.toscrape.com/" # Replace this with your target site
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"
}
timeout = 5 # server response timeout
max_retries = 3 # maximum number of retries on errors
# You can add your proxy here if needed
proxies = {
"http": "http://username:password@proxyserver:port",
"https": "https://username:password@proxyserver:port"
}
# Function to check access via robots.txt
def can_crawl(base_url, path="/"):
try:
robots_url = base_url.rstrip("/") + "/robots.txt"
r = requests.get(robots_url, headers=headers, timeout=timeout)
if r.status_code == 200 and f"Disallow: {path}" in r.text:
print(f"Service {path} forbidden for scraping robots.txt")
return False
except requests.RequestException:
# If robots.txt is unavailable, continue
pass
return True
# Main logic
if can_crawl(url):
for attempt in range(max_retries):
try:
response = requests.get(url, headers=headers, timeout=timeout, proxies=proxies)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
# Collect links
links = [a['href'] for a in soup.find_all('a', href=True)] # You can change selector here
print("Found links:", links)
# Delay between requests to simulate more realistic behavior
time.sleep(random.uniform(3, 7)) # better than a fixed 5-second delay
break # if everything succeeds, exit the retry loop
except requests.RequestException as e:
print(f"Request error (attempt {attempt+1}): {e}")
wait = 2 ** attempt
print(f"Waiting {wait} seconds before retry...")
time.sleep(wait)
else:
print("The crawler cannot process this resource due to robots.txt rules")
This script shows the basic workflow: making a request, parsing HTML, and collecting links.
For multi-page sites, you need a loop that walks through all pages. Example:
for page in range(1, 6):
url = f"https://google.com/page/{page}"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
# data processing
Responsible crawling includes checking a site’s robots.txt file and following its rules. You also need to introduce delays between requests so you don’t overload the server. With the time.sleep() function, you can add pauses between page fetches.
import time
from bs4 import BeautifulSoup
for page in range(1, 6):
url = f"https://google.com/page/{page}"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
# data processing
time.sleep(5) # delay in seconds
You can store the collected details in convenient formats like CSV or JSON. For example, to save a list of links:
import json
data = {"links": links}
with open("links.json", "w", encoding="utf-8") as f:
json.dump(data, f, ensure_ascii=False, indent=4)
By following these steps, you end up with a basic web crawler you can extend for more advanced tasks. You can scale the code, integrate proxy support, handle large numbers of pages, or move to more powerful frameworks like Scrapy for complex data collection scenarios.
Comments: 0