ga
English
Español
中國人
Tiếng Việt
Deutsch
Українська
Português
Français
भारतीय
Türkçe
한국인
Italiano
اردو
Indonesia
Polski Chun obair theagmhála dhírí a éirí amach, teastaíonn bunús láidir uait – bunachar sonraí d’seoltaí ríomhphoist fíor agus cothrom le dáta. Sin an áit a dtagann email scraping le Python isteach: bealach ríomhchláraithe chun seoltaí a bhailiú ó shuíomhanna Gréasáin.
Sa treoir seo, féachfaimid ar conas email scraping a thógáil le Python ón tús, conas leathanaigh dhinimiciúla a láimhseáil, conas na seoltaí a bhailíonn tú a scagadh agus a fhíorú, agus conas na sonraí a úsáid i sreafaí oibre margaíochta nó gnó fíor.
Tá an t-ábhar seo úsáideach má theastaíonn uait:
Ansin, feicfimid conas leathanaigh phoiblí a iompú ina gcainéal cumarsáide díreach le daoine a d’fhéadfadh a bheith ina gcustaiméirí agat – trí úsáid a bhaint as Python.
Go bunúsach, is éard atá i gceist le scraping den chineál seo ná leathanaigh HTML nó dhinimiciúla a scanadh go huathoibríoch agus patrúin a lorg sa t-ábhar nó sna tréithe a mheaitseálann le formáidí seoltaí (mar shampla, username@domain.tld). Ansin déantar scagadh, fíorú, agus sábháil ar na torthaí.
Úsáidtear go forleathan é i ngnó, i margaíocht, i dtaighde, agus chun próisis ghnáthaimh a uathoibriú. Tá sé thar a bheith úsáideach nuair is gá méid mór eolais phoiblí a bhailiú agus a struchtúrú ó iliomad foinsí.
Samplaí de thascanna sonracha ina n-úsáidtear email scraping le Python:
Más mian leat sonraí teagmhála a bhailiú le haghaidh tionscadail ríomhthráchtála, déan iniúchadh ar ár dtreoir faoi Ecommerce data scraping.
Chun scraping a dhéanamh go héifeachtach, ní mór duit an timpeallacht a ullmhú agus na huirlisí cearta a roghnú. Cabhraíonn siad leat sonraí a fháil níos tapúla, leathanaigh chasta nó dhinimiciúla a láimhseáil, agus tionscadail mhóra a eagrú.
Uirlisí Python coitianta le haghaidh scraping:
| Uirlis | Úsáid |
|---|---|
| requests / httpx | Leathanaigh statacha a fháil |
| BeautifulSoup | Parsáil HTML / cuardach eilimintí |
| re (sloinn rialta) | Patrúin a bhaint |
| lxml | Parsáil níos tapúla |
| Selenium / Playwright | Láimhseáil leathanaigh a ritheann JavaScript |
| Scrapy | Creat iomlán le haghaidh crawlanna móra |
pip install requests beautifulsoup4 lxml
pip install selenium # más gá rindreáil dhinimiciúil Chun a fheiceáil conas a chuirtear modhanna den chineál céanna i bhfeidhm ar ardáin eile, seiceáil ár dtreoir mhionsonraithe faoi scrape Reddit using Python.
# 1. Create an HTTP session with timeouts and retries
session = make_session()
# 2. Load the page
html = session.get(url)
# 3. Look for email addresses:
# - via regex across the entire text
# - via mailto: links in HTML
emails = extract_emails_from_text(html)
emails.update(find_mailto_links(html))
# 4. Return a unique list of addresses
return emails
"""
Iterate over internal links within one domain and collect email addresses.
Highlights:
- Page limit (max_pages) to stop safely
- Verifying that a link belongs to the base domain
- Avoiding re-visits
- Optional respect for robots.txt
"""
from __future__ import annotations
from collections import deque
from typing import Set
from urllib.parse import urljoin, urlparse, urlsplit, urlunsplit
import time
import requests
from bs4 import BeautifulSoup
import lxml # Import lxml to ensure it's available for BeautifulSoup
from urllib import robotparser # standard robots.txt parser
# We use functions from the previous block:
# - make_session()
# - scrape_emails_from_url()
import re
# General regular expression for email addresses
EMAIL_RE = re.compile(r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9.-]+")
def scrape_emails_from_url(url: str, session: requests.Session) -> Set[str]:
"""Collect email addresses from the given URL page."""
emails: Set[str] = set()
try:
resp = session.get(url, timeout=getattr(session, "_default_timeout", 10.0))
resp.raise_for_status()
# Regular expression for email addresses
# Note: this regex isn't perfect, but it's sufficient for typical cases
email_pattern = re.compile(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")
emails.update(email_pattern.findall(resp.text))
except requests.RequestException:
pass
return emails
def make_session() -> requests.Session:
"""Create and return a requests session with basic settings."""
session = requests.Session()
session.headers.update({
"User-Agent": "EmailScraper/1.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
# Don't force Accept-Encoding to avoid br issues without brotli
"Connection": "keep-alive",
})
return session
def same_host(url: str, base_netloc: str) -> bool:
"""True if the link belongs to the same host (domain/subdomain)."""
return urlparse(url).netloc == base_netloc
def load_robots(start_url: str, user_agent: str = "EmailScraper") -> robotparser.RobotFileParser:
"""Read robots.txt and return a parser for permission checks."""
base = urlparse(start_url)
robots_url = f"{base.scheme}://{base.netloc}/robots.txt"
rp = robotparser.RobotFileParser()
rp.set_url(robots_url)
try:
rp.read()
except Exception:
pass
rp.useragent = user_agent
return rp
def normalize_url(url: str, base: str | None = None) -> str | None:
try:
abs_url = urljoin(base, url) if base else url
parts = urlsplit(abs_url)
if parts.scheme not in ("http", "https"):
return None
host = parts.hostname
if not host:
return None
host = host.lower()
netloc = host
if parts.port:
netloc = f"{host}:{parts.port}"
parts = parts._replace(fragment="")
return urlunsplit((parts.scheme.lower(), netloc, parts.path or "/", parts.query, ""))
except Exception:
return None
def in_scope(url: str, base_host: str, include_subdomains: bool) -> bool:
try:
host = urlsplit(url).hostname
if not host:
return False
host = host.lower()
base_host = (base_host or "").lower()
if include_subdomains:
return host == base_host or host.endswith("." + base_host)
else:
return host == base_host
except Exception:
return False
def collect_emails_from_site(
start_url: str,
max_pages: int = 100,
delay_sec: float = 0.5,
respect_robots: bool = True,
include_subdomains: bool = True,
) -> Set[str]:
"""
Traverse pages within a domain and return unique email addresses.
- max_pages: hard limit on visited pages.
- delay_sec: polite pause between requests.
- respect_robots: if True — checks access rules.
- include_subdomains: if True — allows subdomains (www, etc.).
"""
session = make_session()
base_host = (urlparse(start_url).netloc or "").lower()
visited: Set[str] = set()
queue: deque[str] = deque()
enqueued: Set[str] = set()
all_emails: Set[str] = set()
start_norm = normalize_url(start_url)
if start_norm:
queue.append(start_norm)
enqueued.add(start_norm)
rp = load_robots(start_url, user_agent="EmailScraper/1.0") if respect_robots else None
while queue and len(visited) < max_pages:
url = queue.popleft()
if url in visited:
continue
# robots.txt check
if respect_robots and rp is not None:
try:
if not rp.can_fetch("EmailScraper/1.0", url):
continue
except Exception:
pass
# One request: used both for emails and links
try:
resp = session.get(url, timeout=10)
resp.raise_for_status()
html_text = resp.text or ""
except requests.RequestException:
continue
visited.add(url)
# Skip non-HTML pages
ctype = resp.headers.get("Content-Type", "")
if ctype and "text/html" not in ctype:
continue
# Collect emails
for m in EMAIL_RE.findall(html_text):
all_emails.add(m.lower())
# Parse links
soup = BeautifulSoup(html_text, "lxml")
# Emails from mailto:
for a in soup.find_all("a", href=True):
href = a["href"].strip()
if href.lower().startswith("mailto:"):
addr_part = href[7:].split("?", 1)[0]
for piece in addr_part.split(","):
email = piece.strip()
if EMAIL_RE.fullmatch(email):
all_emails.add(email.lower())
for a in soup.find_all("a", href=True):
href = a["href"].strip()
if not href or href.startswith(("javascript:", "mailto:", "tel:", "data:")):
continue
next_url = normalize_url(href, base=url)
if not next_url:
continue
if not in_scope(next_url, base_host, include_subdomains):
continue
if next_url not in visited and next_url not in enqueued:
queue.append(next_url)
enqueued.add(next_url)
if delay_sec > 0:
time.sleep(delay_sec)
try:
session.close()
except Exception:
pass
return all_emails
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(
description="An email scraper that traverses pages within a site and prints discovered addresses."
)
parser.add_argument(
"start_url",
help="Starting URL, for example: https://example.com"
)
parser.add_argument(
"--max-pages",
type=int,
default=100,
dest="max_pages",
help="Maximum number of pages to traverse (default: 100)"
)
parser.add_argument(
"--delay",
type=float,
default=0.5,
help="Delay between requests in seconds (default: 0.5)"
)
parser.add_argument(
"--no-robots",
action="store_true",
help="Ignore robots.txt (use carefully)"
)
scope = parser.add_mutually_exclusive_group()
scope.add_argument(
"--include-subdomains",
dest="include_subdomains",
action="store_true",
default=True,
help="Include subdomains (default)"
)
scope.add_argument(
"--exact-host",
dest="include_subdomains",
action="store_false",
help="Restrict traversal to the exact host (no subdomains)"
)
parser.add_argument(
"--output",
type=str,
default=None,
help="Optional: path to a file to save found email addresses (one per line)"
args = parser.parse_args()
emails = collect_emails_from_site(
args.start_url,
max_pages=args.max_pages,
delay_sec=args.delay,
respect_robots=not args.no_robots,
include_subdomains=args.include_subdomains,
)
for e in sorted(emails):
print(e)
print(f"Found {len(emails)} unique emails.")
if args.output:
try:
with open(args.output, "w", encoding="utf-8") as f:
for e in sorted(emails):
f.write(e + "\n")
except Exception as ex:
print(f"Could not write the output file: {ex}")
main.py https://example.com
Nuair a ritheann tú script, ní bhíonn rudaí díreach i gcónaí: bíonn go leor suíomhanna ag cur seoltaí ríomhphoist i bhfolach d’aon ghnó nó ní nochtann siad iad ach amháin tar éis don JavaScript rindreáil. Seo cad is féidir a chur isteach ort — agus conas é a láimhseáil.
1. Obfuscation
Úsáideann suíomhanna minic teicnící chun seoltaí a chur i bhfolach ó bots:
2. Leathanaigh dhinimiciúla
Luchtaíonn suíomhanna nua-aimseartha ábhar go minic trí JavaScript (m.sh., fetch, AJAX). Is féidir le requests.get() simplí “blaosc” HTML folamh a thabhairt ar ais gan ábhar ríomhphoist ann.
Cur chuige praiticiúil nuair a bhíonn tú ag déileáil le leathanaigh den chineál seo:
Seol brabhsálaí, lig don leathanach “luchtú,” fan go dtí go dtaispeántar na heilimintí riachtanacha, ansin gabh an HTML iomlán. Oibríonn sé seo nuair a instealltar an ríomhphost ag JS tar éis rindreála.
Is minic a tharraingíonn an leathanach sonraí ó API i ndáiríre. Seiceáil iarrataí líonra (DevTools → Network) chun a fheiceáil an bhfuil iarratas ann a thugann an ríomhphost nó eolas teagmhála i JSON. Más ea, is fearr an API a úsáid go díreach.
Uaireanta bíonn an seoladh “leabaithe” i JavaScript (m.sh., teaghrán Base64 nó roinnte ina chodanna). Is féidir leat an JS sin a léirmhíniú, an teaghrán a bhaint, agus an seoladh a dhíchódú.
Íoslódáil an íomhá agus cuir OCR (Aitheantas Optúil Carachtar) i bhfeidhm, mar shampla le Tesseract. Tá sé seo níos dian ar acmhainní ach uaireanta riachtanach.
Tagann roinnt eilimintí chun solais tar éis cúpla soicind nó tar éis imeachtaí áirithe (scroll, cliceáil). Tá sé ciallmhar:
Trí na teicnící a phléitear san alt seo a chur i bhfeidhm le haghaidh email scraping le Python, is féidir leat a chinntiú go n-oibreoidh do chuid scriptí go hiontaofa i gcoinníollacha fíor-dhomhanda. Coinnigh i gcuimhne go mbíonn cáilíocht sonraí ag dul i bhfeidhm go díreach ar éifeachtúlacht feachtais ina dhiaidh sin, mar sin is fiú scagadh, fíorú, agus sábháil i bhformáid áisiúil a chur i bhfeidhm ón tús.
Tuairimí: 0