This article focuses on how to extract Quora data and why it could be useful, and how Python can be used to automate this task. We'll also look into the main tools and libraries needed to scrape from this site.
Quora is one of the most popular question-answering websites that people from around the world use. It is especially prominent in the USA, UK, Canada, Australia and other English speaking countries. There are numerous benefits when analyzing the Quora data, such as:
The platform can be used in research, content development or building AI solutions.
Web scraping refers to the act of extracting information and data from a website or a web page. It is a modern way of gathering a lot of information and converting it to an organized structure like a CSV file.
What is web scraping Quora platform give us:
Thus, Quora data scraping gives insight on user participation, significance, and the popularity of different questions as well as the answers they receive.
In order to scrape Quora data Python libraries are used. Below are the most relevant resources that will enable you to achieve that goal:
These libraries make it possible to interact with web pages and collect information from Quora effortlessly.
This subsection addresses the construction of a scraper, which will be a stepwise attention. Prepare the workspace, install requisite libraries, create procedures for sending requests, for parsing the html code, and for working with the retrieved information.
In advance to scraping one must prepare the working environment:
python -m venv quora_scraper
source quora_scraper/bin/activate # For MacOS and Linux
quora_scraper\Scripts\activate # For Windows
Install the necessary libraries that you will need for web scraping in this step:
pip install requests beautifulsoup4
These interfaces are essential for transmitting HTTP requests and subsequently analyzing the received HTML data.
To call a certain API and get the HTML page back, we will make use of the Requests library. It is intuitive and gets the job done very fast.
url = "https://www.quora.com/How-do-you-open-your-own-company"
response = requests.get(url)
In order to use Python, to scrape Quora answers we will need another library called BeautifulSoup. This is how you install it:
soup = BeautifulSoup(response.text, "lxml")
question = soup.find(class_='q-box qu-userSelect--text')
To extract the questions and answers, you only need to find the required HTML tag which has these details. Let us take a case:
text_of_answers = []
for answer_id in range(6):
answers = soup.find_all(
class_=f'q-box dom_annotate_question_answer_item_{answer_id} qu-borderAll qu-borderColor--raised qu-boxShadow--small qu-mb--small qu-bg--raised')
for answer in answers:
text = answer.find('p', class_='q-text qu-display--block qu-wordBreak--break-word qu-textAlign--start')
text_of_answers.append(text)
Quora tends to have a lot of answers and often you need to scroll the page a lot in order to see everything. For this we will be using Selenium to automate the website.
Below is how you set up Selenium and use proxies to hide your identity better:
from selenium.webdriver.chrome.options import Options
from seleniumwire import webdriver as wiredriver
chrome_options = Options()
chrome_options.add_argument(f'--proxy-server={proxy_address}')
chrome_options.add_argument(f'--proxy-auth={proxy_username}:{proxy_password}')
chrome_options.add_argument("--headless")
driver = wiredriver.Chrome(options=chrome_options)
After you have set everything up in Selenium, you just need to make an HTTP request, and then keep scrolling down and capturing these data pieces:
url = "https://www.quora.com/How-do-you-open-your-own-company"
driver.execute_script('window.scrollBy(0, 1000)')
driver.get(url)
When you are done collecting the data, you want to keep them in a way that makes it easy to analyze later on. The best way to do this is to keep them in JSON format, as it is the most widely used.
with open(f'quora_data.json', 'w', encoding='utf-8') as json_file:
json.dump(text_of_answers, json_file, ensure_ascii=False, indent=4)
We saved data to JSON, but in some cases it might be desired to do it in more than one format at once. So here you have a function that does exactly that:
def export_data(data, csv_filename="quora_data.csv", json_filename="quora_data.json"):
save_to_csv(data, csv_filename)
save_to_json(data, json_filename)
export_data(text_of_answers)
Quora has some defenses against automated data scraping, as such trying to send a lot of requests to Quora will result in cutting off your IP address. There are ways around this though.
This is basic mimicry of human behavior that adds some form of lag when a request is sent.
import time
import random
def random_delay():
time.sleep(random.uniform(2, 5))
Add this function after every query to increase the chances of not being blocked.
One can easily avoid blocking the IP address by using proxy servers. Let’s see how it can be used with Requests and Selenium.
url = "https://www.quora.com/How-do-you-open-your-own-company"
proxy = 'LOGIN:PASSWORD@ADDRESS:PORT'
proxies = {
"http": f"http://{proxy}",
"https": f"https://{proxy}",
}
response = requests.get(url, proxies=proxies)
proxy_address = ""
proxy_username = ""
proxy_password = ""
chrome_options = Options()
chrome_options.add_argument(f'--proxy-server={proxy_address}')
chrome_options.add_argument(f'--proxy-auth={proxy_username}:{proxy_password}')
chrome_options.add_argument("--headless")
driver = wiredriver.Chrome(options=chrome_options)
So after you have covered all steps, it is time to combine them and put them into one script.
import json
from selenium.webdriver.chrome.options import Options
from seleniumwire import webdriver as wiredriver
from bs4 import BeautifulSoup
proxy_address = ""
proxy_username = ""
proxy_password = ""
chrome_options = Options()
chrome_options.add_argument(f'--proxy-server={proxy_address}')
chrome_options.add_argument(f'--proxy-auth={proxy_username}:{proxy_password}')
chrome_options.add_argument("--headless")
driver = wiredriver.Chrome(options=chrome_options)
url = "https://www.quora.com/How-do-you-open-your-own-company"
driver.execute_script('window.scrollBy(0, 1000)')
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, "lxml")
question = soup.find(class_='q-box qu-userSelect--text')
# Answers
text_of_answers = [{'question': question}]
for answer_id in range(6):
answers = soup.find_all(
class_=f'q-box dom_annotate_question_answer_item_{answer_id} qu-borderAll qu-borderColor--raised qu-boxShadow--small qu-mb--small qu-bg--raised')
for answer in answers:
text = answer.find('p', class_='q-text qu-display--block qu-wordBreak--break-word qu-textAlign--start').text
text_of_answers.append({
'answers': text,
})
with open(f'Quora_data.json', 'w', encoding='utf-8') as json_file:
json.dump(text_of_answers, json_file, ensure_ascii=False, indent=4)
print(f"Quora_data.json")
In this article we have discussed the methods of Quora scraping with Python. Such scripts constructed enables users to override some restrictions along with pagination and data saving in various formats such as JSON and CSV.
For large-scale data collection, the most effective method would be using Quora’s API. On the other hand, if you strategically choose to utilize BeautifulSoup or Selenium, then using proxy servers is a must for sustained performance.
Comments: 0