For direct outreach to work, you need a solid foundation – a database of real, up-to-date email addresses. That’s where email scraping with Python comes in: a way to programmatically collect addresses from websites.
In this guide, we’ll look at how to build email scraping with Python from scratch, how to handle dynamic pages, how to filter and validate the addresses you collect, and how to use the resulting data in real marketing or business workflows.
This material is useful if you need to:
Next, we’ll see how to turn publicly available pages into a direct-communication channel with people who may become your customers – using Python.
At its core, such scraping is about automatically scanning HTML or dynamic pages and looking in the content or attributes for patterns that match address formats (for example, username@domain.tld). Then you filter, validate, and save the results.
It is widely used in business, marketing, research, and automating routine processes. It’s particularly useful when you need to gather and structure a large volume of public information from multiple sources.
Examples of specific tasks where email scraping with Python is applied:
If you’re interested in gathering contact data for e-commerce projects, explore our guide on Ecommerce data scraping.
To make scraping effective, you need to prepare the environment and choose the right tools. They help you retrieve data faster, handle complex or dynamic pages, and organize larger projects.
Common Python tools for scraping:
Tool | Use |
---|---|
requests / httpx | Fetching static pages |
BeautifulSoup | HTML parsing / element search |
re (regular expressions) | Extracting patterns |
lxml | Faster parsing |
Selenium / Playwright | Handling JavaScript-driven pages |
Scrapy | A full-scale framework for large crawls |
pip install requests beautifulsoup4 lxml
pip install selenium # if you need dynamic rendering
To see how similar methods are applied for other platforms, check our detailed guide on scrape Reddit using Python.
# 1. Create an HTTP session with timeouts and retries
session = make_session()
# 2. Load the page
html = session.get(url)
# 3. Look for email addresses:
# - via regex across the entire text
# - via mailto: links in HTML
emails = extract_emails_from_text(html)
emails.update(find_mailto_links(html))
# 4. Return a unique list of addresses
return emails
"""
Iterate over internal links within one domain and collect email addresses.
Highlights:
- Page limit (max_pages) to stop safely
- Verifying that a link belongs to the base domain
- Avoiding re-visits
- Optional respect for robots.txt
"""
from __future__ import annotations
from collections import deque
from typing import Set
from urllib.parse import urljoin, urlparse, urlsplit, urlunsplit
import time
import requests
from bs4 import BeautifulSoup
import lxml # Import lxml to ensure it's available for BeautifulSoup
from urllib import robotparser # standard robots.txt parser
# We use functions from the previous block:
# - make_session()
# - scrape_emails_from_url()
import re
# General regular expression for email addresses
EMAIL_RE = re.compile(r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9.-]+")
def scrape_emails_from_url(url: str, session: requests.Session) -> Set[str]:
"""Collect email addresses from the given URL page."""
emails: Set[str] = set()
try:
resp = session.get(url, timeout=getattr(session, "_default_timeout", 10.0))
resp.raise_for_status()
# Regular expression for email addresses
# Note: this regex isn't perfect, but it's sufficient for typical cases
email_pattern = re.compile(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")
emails.update(email_pattern.findall(resp.text))
except requests.RequestException:
pass
return emails
def make_session() -> requests.Session:
"""Create and return a requests session with basic settings."""
session = requests.Session()
session.headers.update({
"User-Agent": "EmailScraper/1.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
# Don't force Accept-Encoding to avoid br issues without brotli
"Connection": "keep-alive",
})
return session
def same_host(url: str, base_netloc: str) -> bool:
"""True if the link belongs to the same host (domain/subdomain)."""
return urlparse(url).netloc == base_netloc
def load_robots(start_url: str, user_agent: str = "EmailScraper") -> robotparser.RobotFileParser:
"""Read robots.txt and return a parser for permission checks."""
base = urlparse(start_url)
robots_url = f"{base.scheme}://{base.netloc}/robots.txt"
rp = robotparser.RobotFileParser()
rp.set_url(robots_url)
try:
rp.read()
except Exception:
pass
rp.useragent = user_agent
return rp
def normalize_url(url: str, base: str | None = None) -> str | None:
try:
abs_url = urljoin(base, url) if base else url
parts = urlsplit(abs_url)
if parts.scheme not in ("http", "https"):
return None
host = parts.hostname
if not host:
return None
host = host.lower()
netloc = host
if parts.port:
netloc = f"{host}:{parts.port}"
parts = parts._replace(fragment="")
return urlunsplit((parts.scheme.lower(), netloc, parts.path or "/", parts.query, ""))
except Exception:
return None
def in_scope(url: str, base_host: str, include_subdomains: bool) -> bool:
try:
host = urlsplit(url).hostname
if not host:
return False
host = host.lower()
base_host = (base_host or "").lower()
if include_subdomains:
return host == base_host or host.endswith("." + base_host)
else:
return host == base_host
except Exception:
return False
def collect_emails_from_site(
start_url: str,
max_pages: int = 100,
delay_sec: float = 0.5,
respect_robots: bool = True,
include_subdomains: bool = True,
) -> Set[str]:
"""
Traverse pages within a domain and return unique email addresses.
- max_pages: hard limit on visited pages.
- delay_sec: polite pause between requests.
- respect_robots: if True — checks access rules.
- include_subdomains: if True — allows subdomains (www, etc.).
"""
session = make_session()
base_host = (urlparse(start_url).netloc or "").lower()
visited: Set[str] = set()
queue: deque[str] = deque()
enqueued: Set[str] = set()
all_emails: Set[str] = set()
start_norm = normalize_url(start_url)
if start_norm:
queue.append(start_norm)
enqueued.add(start_norm)
rp = load_robots(start_url, user_agent="EmailScraper/1.0") if respect_robots else None
while queue and len(visited) < max_pages:
url = queue.popleft()
if url in visited:
continue
# robots.txt check
if respect_robots and rp is not None:
try:
if not rp.can_fetch("EmailScraper/1.0", url):
continue
except Exception:
pass
# One request: used both for emails and links
try:
resp = session.get(url, timeout=10)
resp.raise_for_status()
html_text = resp.text or ""
except requests.RequestException:
continue
visited.add(url)
# Skip non-HTML pages
ctype = resp.headers.get("Content-Type", "")
if ctype and "text/html" not in ctype:
continue
# Collect emails
for m in EMAIL_RE.findall(html_text):
all_emails.add(m.lower())
# Parse links
soup = BeautifulSoup(html_text, "lxml")
# Emails from mailto:
for a in soup.find_all("a", href=True):
href = a["href"].strip()
if href.lower().startswith("mailto:"):
addr_part = href[7:].split("?", 1)[0]
for piece in addr_part.split(","):
email = piece.strip()
if EMAIL_RE.fullmatch(email):
all_emails.add(email.lower())
for a in soup.find_all("a", href=True):
href = a["href"].strip()
if not href or href.startswith(("javascript:", "mailto:", "tel:", "data:")):
continue
next_url = normalize_url(href, base=url)
if not next_url:
continue
if not in_scope(next_url, base_host, include_subdomains):
continue
if next_url not in visited and next_url not in enqueued:
queue.append(next_url)
enqueued.add(next_url)
if delay_sec > 0:
time.sleep(delay_sec)
try:
session.close()
except Exception:
pass
return all_emails
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(
description="An email scraper that traverses pages within a site and prints discovered addresses."
)
parser.add_argument(
"start_url",
help="Starting URL, for example: https://example.com"
)
parser.add_argument(
"--max-pages",
type=int,
default=100,
dest="max_pages",
help="Maximum number of pages to traverse (default: 100)"
)
parser.add_argument(
"--delay",
type=float,
default=0.5,
help="Delay between requests in seconds (default: 0.5)"
)
parser.add_argument(
"--no-robots",
action="store_true",
help="Ignore robots.txt (use carefully)"
)
scope = parser.add_mutually_exclusive_group()
scope.add_argument(
"--include-subdomains",
dest="include_subdomains",
action="store_true",
default=True,
help="Include subdomains (default)"
)
scope.add_argument(
"--exact-host",
dest="include_subdomains",
action="store_false",
help="Restrict traversal to the exact host (no subdomains)"
)
parser.add_argument(
"--output",
type=str,
default=None,
help="Optional: path to a file to save found email addresses (one per line)"
args = parser.parse_args()
emails = collect_emails_from_site(
args.start_url,
max_pages=args.max_pages,
delay_sec=args.delay,
respect_robots=not args.no_robots,
include_subdomains=args.include_subdomains,
)
for e in sorted(emails):
print(e)
print(f"Found {len(emails)} unique emails.")
if args.output:
try:
with open(args.output, "w", encoding="utf-8") as f:
for e in sorted(emails):
f.write(e + "\n")
except Exception as ex:
print(f"Could not write the output file: {ex}")
main.py https://example.com
When you run a script, things aren’t always straightforward: many sites deliberately hide email addresses or only expose them after JavaScript renders. Here’s what can get in the way — and how to handle it.
1. Obfuscation
Sites often use techniques to hide addresses from bots:
2. Dynamic pages
Modern sites frequently load content via JavaScript (e.g., fetch, AJAX). A plain requests.get() can return an “empty” HTML shell without the email content.
Practical approaches when you encounter such pages:
Launch a browser, let the page “load,” wait for the required elements, then capture the full HTML. This works when the email is injected by JS after render.
Often the page really does pull data from an API. Check network requests (DevTools → Network) to see if there’s a request that returns the email or contact info in JSON. If yes, it’s better to use the API directly.
Sometimes the address is “embedded” in JavaScript (e.g., a Base64 string or split into parts). You can interpret that JS, extract the string, and decode the address.
Download the image and apply OCR (Optical Character Recognition), for example with Tesseract. This is more resource-intensive but sometimes necessary.
Some elements appear after a few seconds or after specific events (scroll, click). It makes sense to:
By applying the techniques discussed in this article for email scraping with Python, you can make your scripts work reliably in real-world conditions. Keep in mind that data quality directly affects the effectiveness of subsequent campaigns, so it’s worth implementing filtering, validation, and saving to a convenient format from the start.
Comments: 0