Harvesting data off a website is far more than just harvesting its content; there's a lot that goes into it. In order to bypass limits, gaps, and other sophisticated blocks, a multitude of additional methods must be used such as Python data scraping.
For this article, we will define what is Python scraping, and justify why it is the optimal tool for the task. And also outline tactics that make use of Python data scraping capabilities. All of this will help in retrieving information even from the most secure of sites.
This tool is specifically designed to serve as an excellent resource for harvesting data from websites. Apart from its usability, Python's libraries such as Scrapy, Selenium and BeautifulSoup are remarkably powerful. Other than that, there is an active new community that keeps on developing scripts and provides support to new users. That’s why Python is used for web scraping nowadays. So, let’s highlight the main strategies available at this moment.
This block will show the user how to scrape complex websites using more sophisticated techniques built into Python. The user will learn how to:
These approaches would help in making Python data scraping effective while minimizing the chances of being blocked or denied access from the server.
Now, let’s proceed to tactics on how to do scraping in Python in an effective manner.
A plethora of websites implement CAPTCHA security systems as a sterling line of defense to safeguard their information from being premeditatedly scraping data from Python. Such systems can be beaten by many means, employing automatic recognition services, like 2Captcha or Anti-Captcha, or using machine learning to cognitively identify images. Another possibility is reducing the amount of queries to a level that the court does not associate with expectation of Information collection.
In order to make queries less hostile, users must act in a way that is closer to normal behavior. Introduce random timing between actions, switch User-Agent, scroll the page, move the mouse pointer, simulate writing, and even more. The use of Selenium or Playwright as Python scraping tools gives automation much more human-like features so that blocks can be avoided.
import random
import requests
url = 'https://google.com'
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
]
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'accept-language': 'en-IN,en;q=0.9',
'dnt': '1',
'sec-ch-ua': '"Not/A)Brand";v="8", "Chromium";v="126", "Google Chrome";v="126"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Linux"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
}
headers['user-agent'] = random.choice(user_agents)
response = requests.get(url=url, headers=headers)
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
options = webdriver.ChromeOptions()
options.add_argument("--headless") # Running the browser without a graphical interface
driver = webdriver.Chrome(options=options)
driver.get("https://google.com")
# Find an element by XPath
element = driver.find_element(By.XPATH, "//button[text()=Confirm]")
# Use ActionChains to move the cursor
actions = ActionChains(driver)
actions.move_to_element(element).perform()
# Close the browser
driver.quit()
Certain websites embed additional elements designed for regular users that are invisible, yet a bot may accidentally trigger them. These elements include concealed forms, clicking and submitting them will result in the site barring the bot's access. Prior to data collection, utilize CSS styles and attributes such as display: none or opacity: 0 and refrain from engaging those styles.
If the requests are made without correct cookies or session configuration, certain sites will block repeated requests which are considered too simple. To work around this issue, employ requests.Session(), utilize saved cookies, and act as a genuine user. Changing headers User-Agent also needs to be done since the bot is going to be recognized by them.
In the event that the server fails to respond or returns an error temporarily, pause before pouring in further attempts to repeat the command. Exponential backoff is more preferable – it refers to increasing the waiting time following each unsuccessful attempt. For instance, one may increase it by 1 second, 2 seconds, then 4 seconds, etc. This lessens the chance of being blocked while minimizing website limitations and lessening data scraping workload with Python.
import time
import requests
def fetch_with_backoff(url, max_retries=5):
retries = 0
wait_time = 1 # 1-second delay
while retries < max_retries:
try:
response = requests.get(url)
# If the request is successful, return the result
if response.status_code == 200:
return response.text
print(f"Error {response.status_code}. Retrying in {wait_time} sec.")
except requests.exceptions.RequestException as e:
print(f"Connection error: {e}. Retrying in {wait_time} sec.")
# Wait before retrying
time.sleep(wait_time)
# Increase the delay
wait_time *= 2
retries += 1
return None
url = "https://google.com"
html = fetch_with_backoff(url)
Some websites may load content in stages, or they may function only upon receiving some input from the user. In such cases, libraries such as BeautifulSoup are unlikely to help. In this case, web scraping with Selenium, Puppeteer, and Playwright will help. They enable you to open pages like a normal user would, meaning they can click buttons, type text, and otherwise engage with elements on the page.
There are some webpages that will not use JavaScript to show data until a user has accessed the webpage. With this, a standard HTTP request will not fetch all of the necessary information. Selenium can be used to gather such information, or network requests can be scrutinized using browser DevTools. This aids in the detection of concealed API endpoints, which can later be utilized for information retrieval with minimal hassle.
The overwhelming majority of websites pass the automated requests to the server for further processing. It is well known that some websites verify TLS fingerprints as a means of distinguishing automated requests. That means the server studies various connection attributes like TLS/SSL attention using technology, ciphers, and other sophisticated connections. Achieving that could be done by mixing connection attributes in requests utilizing custom headers and proxies.
import requests
url = 'username:password@your-proxy'
proxy = 'your-proxy'
proxies = {
"http": f"http://{proxy}",
"https": f"https://{proxy}",
}
response = requests.get(url=url, proxies=proxies)
If a website offers a public API, it is advisable to use it rather than resorting to scraping. This approach is faster, more reliable, and less likely to result in being blocked. A good starting point for finding an API endpoint is checking the requests that the website makes, which is visible in DevTools. In the absence of an API, you will have to work with the HTML code.
Websites can modify their code, which can delay scraping them. As a counter, consider the following:
In some cases, web scraping with Python from websites can breach their terms of use or even be considered illegal in certain jurisdictions. It is imperative to examine both robots.txt and the terms of service alongside the site policy before scraping data. It’s also best to use a public API if one is available. Furthermore, set limits on request numbers to minimize the strain on the server.
Advanced web scraping with Python comes with its own advantages, but doing it the right way is also equally important. We have discussed important aspects of such a process regarding bypassing CAPTCHA, simulating user actions, managing cookies and sessions, dealing with honeypots, and examining data in asynchronous web applications.
In addition, keep in mind the moral aspect and the relevant agreement of the used site. Utilize API endpoints when available, and if HTML parsing is unavoidable, follow all guidelines to reduce the chances of being blocked and having legal complications.
With the use of this web scraping tutorial methods with Python, the risk potential can be greatly reduced, while efficacy can be increased at a maximum.
Comments: 0