ga
English
Español
中國人
Tiếng Việt
Deutsch
Українська
Português
Français
भारतीय
Türkçe
한국인
Italiano
اردو
Indonesia
Polski Cuireann Scraping Reddit saibhreas faisnéise ar fáil faoi thopaicí trending, rannpháirtíocht an phobail, agus poist a bhfuil an -tóir orthu. Cé gur uirlis choiteann é API oifigiúil Reddit chun rochtain a fháil ar ábhar den sórt sin, tá teorainneacha ann gur féidir le scríobadh a shárú trí sholúbthacht níos mó a sholáthar i roghnú sonraí. Treoróidh an rang teagaisc seo tú trí úsáid a bhaint as an leabharlann drámadóra asynchronous chun ábhar dinimiciúil agus leabharlann LXML a bhainistiú chun na sonraí a bhaint amach, rud a cheadaíonn cur chuige cuimsitheach chun Reddit a scríobadh.
Sula dtosaíonn tú, cinntigh go bhfuil python suiteáilte agat agus na leabharlanna riachtanacha:
pip install playwright
pip install lxml
Tar éis na leabharlanna riachtanacha a shuiteáil, beidh ort na binaries brabhsálaí drámadóra a shuiteáil:
playwright install
Chun brabhsálaí cróimiam a shuiteáil, bain úsáid as an ordú seo a leanas:
Playwright install chromium
Cuideoidh na huirlisí seo linn idirghníomhú le hábhar dinimiciúil Reddit, an HTML a pharsáil, agus na sonraí riachtanacha a bhaint.
Is uirlis chumhachtach é an drámadóir a ligeann dúinn brabhsálaí a rialú agus idirghníomhú le leathanaigh ghréasáin mar a dhéanfadh úsáideoir daonna. Bainfimid úsáid as chun an leathanach Reddit a luchtú agus an t -ábhar HTML a fháil.
Seo cód an drámadóra async chun an leathanach Reddit a luchtú:
import asyncio
from playwright.async_api import async_playwright
async def fetch_page_content():
async with async_playwright() as playwright:
browser = await playwright.chromium.launch(headless=False)
context = await browser.new_context()
page = await context.new_page()
await page.goto("https://www.reddit.com/r/technology/top/?t=week")
page_content = await page.content()
await browser.close()
return page_content
# Faigh ábhar an leathanaigh
page_content = asyncio.run(fetch_page_content())
Agus tú ag scríobadh, d'fhéadfá teacht ar shaincheisteanna mar theorainn ráta nó blocáil IP. Chun iad seo a mhaolú, is féidir leat proxies a úsáid chun do sheoladh IP agus do cheanntásca saincheaptha a rothlú chun fíor -iompar úsáideora a aithris. Is féidir seachvótálaithe a úsáid chun seoltaí IP a rothlú agus a bhrath a sheachaint. Is féidir le do sholáthraí seirbhíse é seo a láimhseáil, ag cinntiú go ndéanann siad bainistiú ar linn IPS agus go n -rothlaíonn siad iad de réir mar is gá.
async def fetch_page_content_with_proxy():
async with async_playwright() as playwright:
browser = await playwright.chromium.launch(headless=True, proxy={
"server": "http://proxy-server:port",
"username": "your-username",
"password": "your-password"
})
context = await browser.new_context()
page = await context.new_page()
await page.goto("https://www.reddit.com/r/technology/top/?t=week", wait_until='networkidle')
page_content = await page.content()
await browser.close()
return page_content
Nuair a bheidh an t -ábhar HTML againn, is é an chéad chéim eile é a pharsáil agus na sonraí ábhartha a bhaint as LXML.
from lxml import html
# Parse an t -ábhar HTML
parser = html.fromstring(page_content)
Tá na poist is fearr ar fho -theicneolaíocht R/Reddit le fáil laistigh de ghnéithe earra. Is féidir na heilimintí seo a dhíriú ar an XPath seo a leanas:
# Eilimintí poist aonair a bhaint amach
elements = parser.xpath('//article[@class="w-full m-0"]')
Is uirlis láidir é XPath chun nóid a nascleanúint agus a roghnú ó dhoiciméad HTML. Bainfimid úsáid as chun an teideal, an nasc agus an chlib a bhaint as gach post.
Seo iad na XPaths sonracha do gach pointe sonraí:
Title: @aria-label
Link: .//div[@class="relative truncate text-12 xs:text-14 font-semibold mb-xs "]/a/@href
Tag: .//span[@class="bg-tone-4 inline-block truncate max-w-full text-12 font-normal align-text-bottom text-secondary box-border px-[6px] rounded-[20px] leading-4 relative top-[-0.25rem] xs:top-[-2px] my-2xs xs:mb-sm py-0 "]/div/text()
Anois agus muid ag díriú ar na heilimintí, is féidir linn gach post a athrá agus na sonraí riachtanacha a bhaint amach.
posts_data = []
# Iterate thar gach eilimint poist
for element in elements:
title = element.xpath('@aria-label')[0]
link = element.xpath('.//div[@class="relative truncate text-12 xs:text-14 font-semibold mb-xs "]/a/@href')[0]
tag = element.xpath('.//span[@class="bg-tone-4 inline-block truncate max-w-full text-12 font-normal align-text-bottom text-secondary box-border px-[6px] rounded-[20px] leading-4 relative top-[-0.25rem] xs:top-[-2px] my-2xs xs:mb-sm py-0 "]/div/text()')[0].strip()
post_info = {
"title": title,
"link": link,
"tag": tag
}
posts_data.append(post_info)
Tar éis na sonraí a bhaint amach, ní mór dúinn é a shábháil i bhformáid struchtúrtha. Is formáid a úsáidtear go forleathan é JSON chun na críche seo.
import json
# Sábháil na sonraí chuig comhad JSON
with open('reddit_posts.json', 'w') as f:
json.dump(posts_data, f, indent=4)
print("Data extraction complete. Saved to reddit_posts.json")
Seo an cód iomlán chun poist barr Reddit a scríobadh ó R/Technology agus na sonraí a shábháil mar JSON:
import asyncio
from playwright.async_api import async_playwright
from lxml import html
import json
async def fetch_page_content():
async with async_playwright() as playwright:
browser = await playwright.chromium.launch(headless=True, proxy={
"server": "IP:port",
"username": "your-username",
"password": "your-password"
})
context = await browser.new_context()
page = await context.new_page()
await page.goto("https://www.reddit.com/r/technology/top/?t=week", wait_until='networkidle')
page_content = await page.content()
await browser.close()
return page_content
# Faigh ábhar an leathanaigh
page_content = asyncio.run(fetch_page_content())
# Parse an t -ábhar HTML ag úsáid LXML
parser = html.fromstring(page_content)
# Eilimintí poist aonair a bhaint amach
elements = parser.xpath('//article[@class="w-full m-0"]')
# Tús a chur le liosta chun na sonraí eastósctha a choinneáil
posts_data = []
# Iterate thar gach eilimint poist
for element in elements:
title = element.xpath('@aria-label')[0]
link = element.xpath('.//div[@class="relative truncate text-12 xs:text-14 font-semibold mb-xs "]/a/@href')[0]
tag = element.xpath('.//span[@class="bg-tone-4 inline-block truncate max-w-full text-12 font-normal align-text-bottom text-secondary box-border px-[6px] rounded-[20px] leading-4 relative top-[-0.25rem] xs:top-[-2px] my-2xs xs:mb-sm py-0 "]/div/text()')[0].strip()
post_info = {
"title": title,
"link": link,
"tag": tag
}
posts_data.append(post_info)
# Sábháil na sonraí chuig comhad JSON
with open('reddit_posts.json', 'w') as f:
json.dump(posts_data, f, indent=4)
print("Data extraction complete. Saved to reddit_posts.json")
Cumasaíonn an modh seo scríobadh thar fho -fho -ailt éagsúla, ag bailiú faisnéise léirsteanach ó na díospóireachtaí saibhre laistigh de phobail Reddit. Tá sé tábhachtach go n -úsáidfí seachvótálaithe rothlacha chun an baol braite ag Reddit a íoslaghdú. Cinntíonn sé go gcinntíonn seachfhreastalaithe dinimiciúla soghluaiste agus cónaithe, a bhfuil an fachtóir iontaobhais is airde acu ar líne, gur féidir sonraí a bhailiú gan captchas nó bloic a spreagadh, rud a éascaíonn taithí scríobtha níos éasca.
Tuairimí: 0