How to Scrape Quora Questions and Answers Using Python

Comments: 0

This article focuses on how to extract Quora data and why it could be useful, and how Python can be used to automate this task. We'll also look into the main tools and libraries needed to scrape from this site.

Why Scrape Quora Data?

Quora is one of the most popular question-answering websites that people from around the world use. It is especially prominent in the USA, UK, Canada, Australia and other English speaking countries. There are numerous benefits when analyzing the Quora data, such as:

  • Providing up-to-date information on relevant topics, expert analysis and trending questions across various subjects.
  • Interacting successfully with target audiences by addressing their inquiries and leveraging the trust for a business or personal expert reputation.
  • Gaining quality answers, sharing of experiences and information without performing holistic extensive searches.

The platform can be used in research, content development or building AI solutions.

Understanding Quora Web Scraping

Web scraping refers to the act of extracting information and data from a website or a web page. It is a modern way of gathering a lot of information and converting it to an organized structure like a CSV file.

What is web scraping Quora platform give us:

  • the question itself.
  • links corresponding to some quora pages.
  • the number of answers and questions so far.
  • the people who authored the answers.
  • the dates they were published.

Thus, Quora data scraping gives insight on user participation, significance, and the popularity of different questions as well as the answers they receive.

Essential Tools for Scraping Quora

In order to scrape Quora data Python libraries are used. Below are the most relevant resources that will enable you to achieve that goal:

  • BeautifulSoup — for parsing the HTML pages and collecting the sought information.
  • Requests — for formulating Hypertext Transfer Protocol requests and getting the web pages.
  • Selenium — for browser automation, as well as for the capturing of dynamically-generated content.

These libraries make it possible to interact with web pages and collect information from Quora effortlessly.

Building a Quora Scraping Python Script

This subsection addresses the construction of a scraper, which will be a stepwise attention. Prepare the workspace, install requisite libraries, create procedures for sending requests, for parsing the html code, and for working with the retrieved information.

Step 1: Setting Up Your Python Project Environment

In advance to scraping one must prepare the working environment:

  1. Download and install Python from the official page.
  2. Set up a virtual environment to compartmentalize your libraries in use.
    
    python -m venv quora_scraper
    source quora_scraper/bin/activate  # For MacOS and Linux
    quora_scraper\Scripts\activate     # For Windows
    
    

Step 2: Installing Required Python Libraries

Install the necessary libraries that you will need for web scraping in this step:


pip install requests beautifulsoup4

These interfaces are essential for transmitting HTTP requests and subsequently analyzing the received HTML data.

Step 3: Sending Requests to Quora

To call a certain API and get the HTML page back, we will make use of the Requests library. It is intuitive and gets the job done very fast.


url = "https://www.quora.com/How-do-you-open-your-own-company"
response = requests.get(url)

Step 4: Parsing the HTML Response

In order to use Python, to scrape Quora answers we will need another library called BeautifulSoup. This is how you install it:


soup = BeautifulSoup(response.text, "lxml")
question = soup.find(class_='q-box qu-userSelect--text')

Step 5: Extracting Quora Questions and Answers

To extract the questions and answers, you only need to find the required HTML tag which has these details. Let us take a case:


text_of_answers = []
for answer_id in range(6):
   answers = soup.find_all(
       class_=f'q-box dom_annotate_question_answer_item_{answer_id} qu-borderAll qu-borderColor--raised qu-boxShadow--small qu-mb--small qu-bg--raised')

   for answer in answers:
       text = answer.find('p', class_='q-text qu-display--block qu-wordBreak--break-word qu-textAlign--start')
       text_of_answers.append(text)

Step 6: Handling Pagination for Multiple Pages

Quora tends to have a lot of answers and often you need to scroll the page a lot in order to see everything. For this we will be using Selenium to automate the website.

Below is how you set up Selenium and use proxies to hide your identity better:


from selenium.webdriver.chrome.options import Options
from seleniumwire import webdriver as wiredriver

chrome_options = Options()
chrome_options.add_argument(f'--proxy-server={proxy_address}')
chrome_options.add_argument(f'--proxy-auth={proxy_username}:{proxy_password}')

chrome_options.add_argument("--headless")

driver = wiredriver.Chrome(options=chrome_options)

After you have set everything up in Selenium, you just need to make an HTTP request, and then keep scrolling down and capturing these data pieces:


url = "https://www.quora.com/How-do-you-open-your-own-company"
driver.execute_script('window.scrollBy(0, 1000)')

driver.get(url)

Step 7: Storing the Scraped Data Efficiently

When you are done collecting the data, you want to keep them in a way that makes it easy to analyze later on. The best way to do this is to keep them in JSON format, as it is the most widely used.


with open(f'quora_data.json', 'w', encoding='utf-8') as json_file:
   json.dump(text_of_answers, json_file, ensure_ascii=False, indent=4)

Step 8: Exporting Data to CSV or JSON Format

We saved data to JSON, but in some cases it might be desired to do it in more than one format at once. So here you have a function that does exactly that:


def export_data(data, csv_filename="quora_data.csv", json_filename="quora_data.json"):
    save_to_csv(data, csv_filename)
    save_to_json(data, json_filename)

export_data(text_of_answers)

Step 9: Handling Rate Limits and Blocking Issues

Quora has some defenses against automated data scraping, as such trying to send a lot of requests to Quora will result in cutting off your IP address. There are ways around this though.

  1. Adding some time before sending the next request.

    This is basic mimicry of human behavior that adds some form of lag when a request is sent.

    
    import time
    import random
    
    def random_delay():
        time.sleep(random.uniform(2, 5))
    
    

    Add this function after every query to increase the chances of not being blocked.

  2. Use proxy servers.

    One can easily avoid blocking the IP address by using proxy servers. Let’s see how it can be used with Requests and Selenium.

    • requests:
      
      url = "https://www.quora.com/How-do-you-open-your-own-company" 
      
      
      
      proxy = 'LOGIN:PASSWORD@ADDRESS:PORT'
      proxies = {
         "http": f"http://{proxy}",
         "https": f"https://{proxy}",
      }
      
      response = requests.get(url, proxies=proxies)
      
      
    • Selenium:
      
      proxy_address = ""
      proxy_username = ""
      proxy_password = ""
      
      chrome_options = Options()
      chrome_options.add_argument(f'--proxy-server={proxy_address}')
      chrome_options.add_argument(f'--proxy-auth={proxy_username}:{proxy_password}')
      
      chrome_options.add_argument("--headless")
      
      driver = wiredriver.Chrome(options=chrome_options)
      
      

Step 10: Putting Everything Together

So after you have covered all steps, it is time to combine them and put them into one script.


import json
from selenium.webdriver.chrome.options import Options
from seleniumwire import webdriver as wiredriver
from bs4 import BeautifulSoup

proxy_address = ""
proxy_username = ""
proxy_password = ""

chrome_options = Options()
chrome_options.add_argument(f'--proxy-server={proxy_address}')
chrome_options.add_argument(f'--proxy-auth={proxy_username}:{proxy_password}')

chrome_options.add_argument("--headless")

driver = wiredriver.Chrome(options=chrome_options)

url = "https://www.quora.com/How-do-you-open-your-own-company"
driver.execute_script('window.scrollBy(0, 1000)')

driver.get(url)

html = driver.page_source

soup = BeautifulSoup(html, "lxml")

question = soup.find(class_='q-box qu-userSelect--text')

# Answers
text_of_answers = [{'question': question}]
for answer_id in range(6):
   answers = soup.find_all(
       class_=f'q-box dom_annotate_question_answer_item_{answer_id} qu-borderAll qu-borderColor--raised qu-boxShadow--small qu-mb--small qu-bg--raised')

   for answer in answers:
       text = answer.find('p', class_='q-text qu-display--block qu-wordBreak--break-word qu-textAlign--start').text
       text_of_answers.append({
           'answers': text,
       })

with open(f'Quora_data.json', 'w', encoding='utf-8') as json_file:
   json.dump(text_of_answers, json_file, ensure_ascii=False, indent=4)
   print(f"Quora_data.json")

Conclusion

In this article we have discussed the methods of Quora scraping with Python. Such scripts constructed enables users to override some restrictions along with pagination and data saving in various formats such as JSON and CSV.

For large-scale data collection, the most effective method would be using Quora’s API. On the other hand, if you strategically choose to utilize BeautifulSoup or Selenium, then using proxy servers is a must for sustained performance.

Comments:

0 comments