How to scrape Twitter data using a python script

Comments: 0

Creating a Python script for scraping Twitter data is indeed useful for gathering insights, such as user reviews or discussions around specific topics, which can greatly aid in marketing and research. Automation using such scripts streamlines the collection process, making it fast and efficient.

Step 1: Installations and imports

There are 2 packages you must install before you begin writing the actual code. You also need a package manager for Python packages (PIP) to install these packages. Luckily, once you install Python on your machine, PIP is installed too. To install these packages, you only need to run the command below in your Command Line Interface (CLI).

pip install selenium-wire selenium undetected-chromedriver

Once the installation is complete, you must import these packages into your Python file as shown below.

from seleniumwire import webdriver as wiredriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
from time import sleep
import json
import undetected_chromedriver as uc
import random
Import ssl
  • Seleniumwire: enhances Selenium by adding the ability to configure proxies directly, crucial for avoiding blockages during scraping activities.
  • Selenium: facilitates data scraping with tools like ActionChains and “Keys” for simulating browser actions, “By” for element search, “WebDriverWait”, and “expected_conditions” for condition-based execution.
  • Undetected Chromedriver: alters ChromeDriver for use with Selenium Wire to circumvent bot detection mechanisms on websites, reducing blocking risks.
  • time, random, json: standard Python libraries for managing operation timing and handling data in JSON format.

Step 2: Proxy initialization

It has been established severally that using a proxy during scraping is important. Twitter is one of the social media platforms that frowns at data scraping and to be safe and avoid a ban, you should use a proxy.

All you have to do is provide your proxy address, proxy username and password and your IP should now be masked and protected. Running a headless browser, basically the same as running a browser without an interface, helps speed up the scraping process, which is why we added the headless flag in the options.

# Specify the proxy server address with username and password in a List of proxies
proxies = [
    "proxy_username:proxy_password@proxy_address:port_number",
]




# function to get a random proxy
def get_proxy():
    return random.choice(proxies)


# Set up Chrome options with the proxy and authentication
chrome_options = Options()
chrome_options.add_argument("--headless")


proxy = get_proxy()
proxy_options = {
    "proxy": {
        "http": f"http://{proxy}",
        "https": f"https://{proxy}",
    }
}

Step 3: How to log In to X/Twitter

To effectively scrape Twitter data using Python, the script requires access credentials for the Twitter account, including the username and password.

Additionally, you must specify a search keyword. The script uses the command https://twitter.com/search?q={search_keyword}&src=typed_query&f=top to construct a URL that enables the search for this keyword on Twitter.

The next step involves creating an instance of ChromeDriver, incorporating proxy details as an option. This setup directs ChromeDriver to use a specific IP address when loading the page. Following this setup, the search URL is loaded with these configurations. Once the page is loaded, you must log in to access the search results. Using WebDriverWait, the script verifies that the page is fully loaded by checking for the presence of the username entry area. If this area fails to load, it is advisable to terminate the ChromeDriver instance.

search_keyword = input("What topic on X/Twitter would you like to gather data on?\n").replace(' ', '%20')
constructed_url = f"https://twitter.com/search?q={search_keyword}&src=typed_query&f=top"


# provide your X/Twitter username and password here
x_username = "" 
x_password = ""


print(f'Opening {constructed_url} in Chrome...')


# Create a WebDriver instance with undetected chrome driver
driver = uc.Chrome(options=chrome_options, seleniumwire_options=proxy_options)


driver.get(constructed_url)


try:
    element = WebDriverWait(driver, 20).until(
        EC.presence_of_element_located((By.XPATH, "//div[@class='css-175oi2r r-1mmae3n r-1e084wir-13qz1uu']"))
    )
except Exception as err:
    print(f'WebDriver Wait Error: Most likely Network TimeOut: Details\n{err}')
    driver.quit()


#Sign In
if element:
    username_field = driver.find_element(By.XPATH, "//input[@class='r-30o5oe r-1dz5y72 r-13qz1uu r-1niwhzg r-17gur6a r-1yadl64 r-deolkf r-homxoj r-poiln3 r-7cikom r-1ny4l3l r-t60dpp r-fdjqy7']")
    username_field.send_keys(x_username)
    username_field..send_keys(Keys.ENTER)


    password_field = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.XPATH, "//input[@class='r-30o5oe r-1dz5y72 r-13qz1uu r-1niwhzg r-17gur6a r-1yadl64 r-deolkf r-homxoj r-poiln3 r-7cikom r-1ny4l3l r-t60dpp r-fdjqy7']"))
    )
    password_field.send_keys(x_password)
    password_field.send_keys(Keys.ENTER)


    print("Sign In Successful...\n")


    sleep(10)

Step 4: Extracting the top results

Create a list variable, results, to systematically store all the gleaned data in the format of dictionaries. Subsequent to this, establish a function named scrape() to systematically collect a wealth of data for each tweet, encompassing crucial details like the display name, username, post content, and metrics such as likes and impressions.

A proactive approach has been adopted to guarantee uniformity in the lengths of the lists. The min() function ensures that each list's length aligns with the others. By adhering to this methodology, we ensure a synchronized and structured approach to collecting and processing Twitter data.

When we scrape the vanity numbers/metrics, they are returned as strings not as numbers. Then, we need to convert the strings into numbers using convert_to_numeric() so the result can be organised by impressions.

results = []


# Scrape
def scrape():
   display_names = driver.find_elements(By.XPATH,
                                        '//*[@class="css-175oi2r r-1wbh5a2 r-dnmrzs r-1ny4l3l r-1awozwy r-18u37iz"]/div[1]/div/a/div/div[1]/span/span')
   usernames = driver.find_elements(By.XPATH,
                                    '//*[@class="css-175oi2r r-1wbh5a2 r-dnmrzs r-1ny4l3l r-1awozwy r-18u37iz"]/div[2]/div/div[1]/a/div/span')
   posts = driver.find_elements(By.XPATH,
                                '//*[@class="css-146c3p1 r-8akbws r-krxsd3 r-dnmrzs r-1udh08x r-bcqeeo r-1ttztb7 r-qvutc0 r-37j5jr r-a023e6 r-rjixqe r-16dba41 r-bnwqim"]/span')
   comments = driver.find_elements(By.XPATH,
                                   '//*[@class="css-175oi2r r-1kbdv8c r-18u37iz r-1wtj0ep r-1ye8kvj r-1s2bzr4"]/div[1]/button/div/div[2]/span/span/span')
   retweets = driver.find_elements(By.XPATH,
                                   '//*[@class="css-175oi2r r-1kbdv8c r-18u37iz r-1wtj0ep r-1ye8kvj r-1s2bzr4"]/div[2]/button/div/div[2]/span/span/span')
   likes = driver.find_elements(By.XPATH,
                                '//*[@class="css-175oi2r r-1kbdv8c r-18u37iz r-1wtj0ep r-1ye8kvj r-1s2bzr4"]/div[3]/button/div/div[2]/span/span/span')
   impressions = driver.find_elements(By.XPATH,
                                      '//*[@class="css-175oi2r r-1kbdv8c r-18u37iz r-1wtj0ep r-1ye8kvj r-1s2bzr4"]/div[4]/a/div/div[2]/span/span/span')

   min_length = min(len(display_names), len(usernames), len(posts), len(comments), len(retweets), len(likes),
                    len(impressions))

   for each in range(min_length):
       results.append({
           'Username': usernames[each].text,
           'displayName': display_names[each].text,
           'Post': posts[each].text.rstrip("Show more"),
           'Comments': 0 if comments[each].text == "" else convert_to_numeric(comments[each].text),
           'Retweets': 0 if retweets[each].text == "" else convert_to_numeric(retweets[each].text),
           'Likes': 0 if likes[each].text == "" else convert_to_numeric(likes[each].text),
           'Impressions': 0 if impressions[each].text == "" else convert_to_numeric(impressions[each].text)
       })


def reorder_json_by_impressions(json_data):
   # Sort the JSON list in-place based on 'Impressions' in descending order
   json_data.sort(key=lambda x: int(x['Impressions']), reverse=True)


def organize_write_data(data: dict):
   output = json.dumps(data, indent=2, ensure_ascii=False).encode("ascii", "ignore").decode("utf-8")
   try:
       with open("result.json", 'w', encoding='utf-8') as file:
           file.write(output)
   except Exception as err:
       print(f"Error encountered: {err}")


def convert_to_numeric(value):
   multipliers = {'K': 10 ** 3, 'M': 10 ** 6, 'B': 10 ** 9}

   try:
       if value[-1] in multipliers:
           return int(float(value[:-1]) * multipliers[value[-1]])
       else:
           return int(value)
   except ValueError:
       # Handle the case where the conversion fails
       return None

Step 5: Organizing the data

To better organize the data, we created a function which takes the results and sorts the tweets in descending order using the number of impressions gathered by each tweet. Logically, we want to see the tweet with the highest vanity number first before others.

def reorder_json_by_impressions(json_data):
    # Sort the JSON list in-place based on 'Impressions' in descending order
    json_data.sort(key=lambda x:int(x['Impressions']), reverse=True)

Write to a JSON file

A JSON file is the best way to visualize all the data collected. Writing to a JSON file is just like writing to any other file in Python. The only difference is that we need the JSON module to properly format the data before it is written to the file.

If the code ran correctly, you should see a result.json file in the file structure and in it should be the result as shown in the section below.

def organize_write_data(data:dict):
    output = json.dumps(data, indent=2, ensure_ascii=False).encode("ascii", "ignore").decode("utf-8")
    try:
        with open("result.json", 'w', encoding='utf-8') as file:
            file.write(output)
    except Exception as err:
        print(f"Error encountered: {err}") 

Pagination

To begin the execution of the code, we need to call our functions sequentially to commence data scraping. We create a reference using the ActionChains module within Selenium to facilitate various Selenium actions. This module proves pivotal for simulating scrolling down on the page.

The first round involves scraping data from the currently loaded page. Subsequently, a loop is initiated, iterating five times, during which the page is scrolled down, followed by a five-second pause before the next scraping iteration.

Users can adjust the loop's range, either increasing or decreasing it to customize the volume of data scraped. It is crucial to note that if there is no additional content to display, the script will persistently scrape the same data, resulting in redundancy. To prevent this, adjust the loop range accordingly to avoid redundant data recording.

actions = ActionChains(driver)
for i in range(5):
    actions.send_keys(Keys.END).perform()
    sleep(5)
    scrape()


reorder_json_by_impressions(results)
organize_write_data(results)


print(f"Scraping Information on {search_keyword} is done.")


driver.quit()

Full code

from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
from time import sleep
import json
import undetected_chromedriver as uc
import random
import ssl

ssl._create_default_https_context = ssl._create_stdlib_context


search_keyword = input("What topic on X/Twitter would you like to gather data on?\n").replace(' ', '%20')
constructed_url = f"https://twitter.com/search?q={search_keyword}&src=typed_query&f=top"

# provide your X/Twitter username and password here
x_username = ""
x_password = ""

print(f'Opening {constructed_url} in Chrome...')

# Specify the proxy server address with username and password in a List of proxies
proxies = [
   "USERNAME:PASSWORD@IP:PORT",
]


# function to get a random proxy
def get_proxy():
   return random.choice(proxies)


# Set up Chrome options with the proxy and authentication
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument('--ignore-certificate-errors')
chrome_options.add_argument('--ignore-ssl-errors')

proxy = get_proxy()
proxy_options = {
   "proxy": {
       "http": f"http://{proxy}",
       "https": f"https://{proxy}",
   }
}

# Create a WebDriver instance with undetected chrome driver
driver = uc.Chrome(options=chrome_options, seleniumwire_options=proxy_options)

driver.get(constructed_url)

try:
   element = WebDriverWait(driver, 20).until(
       EC.presence_of_element_located((By.XPATH, "//div[@class='css-175oi2r r-1mmae3n r-1e084wi r-13qz1uu']"))
   )
except Exception as err:
   print(f'WebDriver Wait Error: Most likely Network TimeOut: Details\n{err}')
   driver.quit()

# Sign In
if element:
   username_field = driver.find_element(By.XPATH,
                                        "//input[@class='r-30o5oe r-1dz5y72 r-13qz1uu r-1niwhzg r-17gur6a r-1yadl64 r-deolkf r-homxoj r-poiln3 r-7cikom r-1ny4l3l r-t60dpp r-fdjqy7']")
   username_field.send_keys(x_username)
   username_field.send_keys(Keys.ENTER)

   password_field = WebDriverWait(driver, 10).until(
       EC.presence_of_element_located((By.XPATH,
                                       "//input[@class='r-30o5oe r-1dz5y72 r-13qz1uu r-1niwhzg r-17gur6a r-1yadl64 r-deolkf r-homxoj r-poiln3 r-7cikom r-1ny4l3l r-t60dpp r-fdjqy7']"))
   )
   password_field.send_keys(x_password)
   password_field.send_keys(Keys.ENTER)

   print("Sign In Successful...\n")

   sleep(10)

results = []


# Scrape
def scrape():
   display_names = driver.find_elements(By.XPATH,
                                        '//*[@class="css-175oi2r r-1wbh5a2 r-dnmrzs r-1ny4l3l r-1awozwy r-18u37iz"]/div[1]/div/a/div/div[1]/span/span')
   usernames = driver.find_elements(By.XPATH,
                                    '//*[@class="css-175oi2r r-1wbh5a2 r-dnmrzs r-1ny4l3l r-1awozwy r-18u37iz"]/div[2]/div/div[1]/a/div/span')
   posts = driver.find_elements(By.XPATH,
                                '//*[@class="css-146c3p1 r-8akbws r-krxsd3 r-dnmrzs r-1udh08x r-bcqeeo r-1ttztb7 r-qvutc0 r-37j5jr r-a023e6 r-rjixqe r-16dba41 r-bnwqim"]/span')
   comments = driver.find_elements(By.XPATH,
                                   '//*[@class="css-175oi2r r-1kbdv8c r-18u37iz r-1wtj0ep r-1ye8kvj r-1s2bzr4"]/div[1]/button/div/div[2]/span/span/span')
   retweets = driver.find_elements(By.XPATH,
                                   '//*[@class="css-175oi2r r-1kbdv8c r-18u37iz r-1wtj0ep r-1ye8kvj r-1s2bzr4"]/div[2]/button/div/div[2]/span/span/span')
   likes = driver.find_elements(By.XPATH,
                                '//*[@class="css-175oi2r r-1kbdv8c r-18u37iz r-1wtj0ep r-1ye8kvj r-1s2bzr4"]/div[3]/button/div/div[2]/span/span/span')
   impressions = driver.find_elements(By.XPATH,
                                      '//*[@class="css-175oi2r r-1kbdv8c r-18u37iz r-1wtj0ep r-1ye8kvj r-1s2bzr4"]/div[4]/a/div/div[2]/span/span/span')

   min_length = min(len(display_names), len(usernames), len(posts), len(comments), len(retweets), len(likes),
                    len(impressions))

   for each in range(min_length):
       results.append({
           'Username': usernames[each].text,
           'displayName': display_names[each].text,
           'Post': posts[each].text.rstrip("Show more"),
           'Comments': 0 if comments[each].text == "" else convert_to_numeric(comments[each].text),
           'Retweets': 0 if retweets[each].text == "" else convert_to_numeric(retweets[each].text),
           'Likes': 0 if likes[each].text == "" else convert_to_numeric(likes[each].text),
           'Impressions': 0 if impressions[each].text == "" else convert_to_numeric(impressions[each].text)
       })


def reorder_json_by_impressions(json_data):
   # Sort the JSON list in-place based on 'Impressions' in descending order
   json_data.sort(key=lambda x: int(x['Impressions']), reverse=True)


def organize_write_data(data: dict):
   output = json.dumps(data, indent=2, ensure_ascii=False).encode("ascii", "ignore").decode("utf-8")
   try:
       with open("result.json", 'w', encoding='utf-8') as file:
           file.write(output)
   except Exception as err:
       print(f"Error encountered: {err}")


def convert_to_numeric(value):
   multipliers = {'K': 10 ** 3, 'M': 10 ** 6, 'B': 10 ** 9}

   try:
       if value[-1] in multipliers:
           return int(float(value[:-1]) * multipliers[value[-1]])
       else:
           return int(value)
   except ValueError:
       # Handle the case where the conversion fails
       return None


actions = ActionChains(driver)
for i in range(5):
   actions.send_keys(Keys.END).perform()
   sleep(5)
   scrape()

reorder_json_by_impressions(results)
organize_write_data(results)

print(f"Scraping Information on {search_keyword} is done.")

driver.quit()

Final results

Here’s what the JSON file should like after the scraping is done:

[
  {
    "Username": "@LindaEvelyn_N",
    "displayName": "Linda Evelyn Namulindwa",
    "Post": "Still getting used to Ugandan local foods so I had Glovo deliver me a KFC Streetwise Spicy rice meal (2 pcs of chicken & jollof rice at Ugx 18,000)\n\nNot only was it fast but it also accepts all payment methods.\n\n#GlovoDeliversKFC\n#ItsFingerLinkingGood",
    "Comments": 105,
    "Retweets": 148,
    "Likes": 1500,
    "Impressions": 66000
  },
  {
    "Username": "@GymCheff",
    "displayName": "The Gym Chef",
    "Post": "Delicious High Protein KFC Zinger Rice Box!",
    "Comments": 1,
    "Retweets": 68,
    "Likes": 363,
    "Impressions": 49000
  }
]

The guide outlined can be utilized to scrape data on various topics of interest, facilitating studies in public sentiment analysis, trend tracking, monitoring, and reputation management. Python, in turn, simplifies the process of automatic data collection with its extensive array of built-in modules and functions. These tools are essential for configuring proxies, managing page scrolling, and organizing the collected information effectively.

Comments:

0 comments