How to Scrape Public GitHub Repositories Using Python?

Comments: 0

In this article, we will analyze how to scrape GitHub repositories using Python. We will review well-known libraries like BeautifulSoup and Requests and give a detailed explanation while building scraping GitHub script so that you can follow along easily.

Why scraping GitHub repositories?

What is Github scraping and what is it for? There are numerous purposes to use it, but the most frequent include:

  • To monitor technological advancements. Monitoring stars and repositories is a great way to track trends with programming languages, frameworks, and libraries. This information can be very important when it comes to making decisions related to the implementation of technologies, skill development, or resource allocation.
  • To utilize a programming repository. A plethora of open-source projects, examples, and solutions can be found on GitHub. Therefore, a massive amount of programming knowledge and techniques useful for educational purposes, enhancing programming skills, and comprehension of technology implementation are readily available on the platform.

Not only does GitHub offer hosting for repositories, but they have a massive user base, and a great reputation makes them a reliable option.

The information stored on GitHub is useful in tracking the patterns of technology progression as well as improving the development of software. This information is critical to keeping up with competitors in the world of information technology.

Essential Libraries and Tools for Scraping GitHub

When it comes to scraping a web page, the easiest language to work with is Python, since its libraries and modules are so many. In order to Github scraping Python, the following modules have to be added:

  • requests: client lib who holds the HTTP requests and answers, and the one most used.
  • BeautifulSoup: sophisticated when it comes to extracting details from HTML as it comes with advanced features when it comes to navigating and retrieving data.
  • Selenium: launches a real browser and enables clicking and typing onto elements of the page.

This can be done so easily with the use of Github scraping tools. However, let us elaborate on how to do so.

Building a Scraping GitHub Repository Script with Beautiful Soup

This part will show how to scrape Github repositories. Important steps in making the scraper consist of:

  1. Creating the set base – which entails the downloading of Python and relevant libraries.
  2. The saving of the HTML code of the GitHub page.
  3. Examining the layout of the page in order to identify the needed items.
  4. Collecting information which involves getting the accounts names, descriptions, and the amount of stars gotten among other things.
  5. Data storage to file system types.

From now on, we’ll share the details for all these steps along with a fully formed scraping Github script.

Step 1: Setting Up Your Python Project Environment

Make sure you have Python on your machine. Next, make a new Python virtual environment for starting the process of scraping Github.


python -m venv github_scraper
source github_scraper/bin/activate  # For MacOS and Linux
github_scraper\Scripts\activate     # For Windows

Step 2: Installing Required Python Libraries

Like we said before, BeautifulSoup and Requests will help for scraping GitHub repositories. In the currently activated virtual environment, execute this command to include them in your project dependencies:


pip install beautifulsoup4 requests

Step 3: Accessing and Downloading the Target GitHub Page

Select any repository from which you would like to perform scraping Github. First, define the link in a variable, then make an HTTP request that will get the page code.


url = "https://github.com/TheKevJames/coveralls-python"
response = requests.get(url)

Step 4: Understanding and Parsing the HTML Structure

To analyze your HTML, you feed it into BeautifulSoup.


soup = BeautifulSoup(page.text, 'html.parser')

The BeautifulSoup() constructor needs to be provided with two things.

  1. A string containing the HTML content, stored in the page.text variable.
  2. The parser that Beautiful Soup will use: “html.parser” is the name of the built-in Python HTML parser.

The HTML will be parsed by BeautifulSoup and a tree structure will be generated. More specifically, the soup variable contains all the methods required to select relevant elements from the DOM tree such as:

  • find(): returns the first HTML element that corresponds to the provided selector strategy.
  • find_all(): returns a list of HTML elements that match the input selector strategy.
  • select_one(): returns the first HTML element that matches the input CSS selector.
  • select(): returns a list of HTML elements corresponding to the provided CSS selector.

Step 5: Analyzing the Target Page for Relevant Data

Here's another important step: picking simple HTML elements and scraping Github data from them. This step precedes actually writing Python Github repositories script.

Before scraping Github, familiarize yourself with the webpage itself. After this, open the developer tools by clicking F12. Now that you have opened the code, you will notice that many elements on the page do not have unique classes or attributes which would make it easy to navigate to the appropriate element. Go through the page and prepare to extract data.

Step 6: Extracting Repository Details

Now we are ready to create a Python script that will help us for scraping GitHub repositories. This script can extract useful information stored in GitHub like stars, descriptions, last commits, etc. For this, we need to specify the required features and retrieve corresponding text values.

  • Repository name:
    
    repo_title = soup.select_one('[itemprop="name"]').text.strip()
    

    The attribute itemprop=”name” has an attribute value which has a unique identity, so we therefore retrieve it. Text fields on GitHub generally have spaces and newline characters and can be cleaned by using strip().

  • Current branch:
    
    git_branch_icon_html_element = soup.select_one('[class="Box-sc-g0xbh4-0 ffLUq ref-selector-button-text-container"]').text.split()
    

    Pay attention to the fact that there is no simpler method of selecting the HTML component that holds the title of the principal branch. What can you do? Pick a class that is unique and get the text.

  • Last commit:
    
    relative_time_html_element = soup.select_one('relative-time')
    latest_commit = relative_time_html_element['datetime']
    

    We noted that the relative-time tag, where the last commit by the user was stored, was already selected, and the date was selected through datetime.

Gather information which is on the left hand side: description, stars, views, forks.

  • Description:
    
    bordergrid_html_element = soup.select_one('.BorderGrid')
    about_html_element = bordergrid_html_element.select_one('h2')
    description_html_element = about_html_element.find_next_sibling('p')
    description = description_html_element.get_text().strip()
    
  • Stars:
    
    star_icon_html_element = bordergrid_html_element.select_one('.octicon-star')
    stars_html_element = star_icon_html_element.find_next_sibling('strong')
    stars = stars_html_element.get_text().strip().replace(',', '')
    
  • Views:
    
    eye_icon_html_element = bordergrid_html_element.select_one('.octicon-eye')
    watchers_html_element = eye_icon_html_element.find_next_sibling('strong')
    watchers = watchers_html_element.get_text().strip().replace(',', '')
    
  • Forks:
    
    fork_icon_html_element = bordergrid_html_element.select_one('.octicon-repo-forked')
    forks_html_element = fork_icon_html_element.find_next_sibling('strong')
    forks = forks_html_element.get_text().strip().replace(',', '')
    

Step 7: Collecting and Analyzing Readme Files

The readme file is very crucial. It provides descriptions of the repositories and instructions on how to implement the code. In the case that you check the readme.md file, you can notice what link it has:


https://raw.githubusercontent.com///readme.md

Since we have , we can programmatically create the URL using an f-string, use it, and make an HTTP request to get the file code.


readme_url = f'https://github.com/TheKevJames/coveralls-python/blob/{main_branch}/readme.rst'
readme_page = requests.get(readme_url)

readme = None
if readme_page.status_code != 404:
    readme = readme_page.text

Remember to check for the 404 error to avert saving the content of the GitHub 404 webpage if the repository does not have a readme file.

Step 8: Organizing and Storing Scraped Data Efficiently

All the information will be stored in a single dictionary so we can easily write them into a JSON file.


repo = {}
repo['name'] = repo_title
repo['latest_commit'] = latest_commit
repo['main_branch'] = main_branch
repo['description'] = description
repo['stars'] = stars
repo['watchers'] = watchers
repo['forks'] = forks
repo['readme'] = readme

Step 9: Exporting Scraped Data in JSON Format

We’ll leverage the built-in Python library to analyze the data and store it in the JSON format, as it’s perfect for nested structures like in our case when articles have lists.


with open('github_data.json', 'w', encoding='utf-8') as json_file:
  json.dump(repo, json_file, ensure_ascii=False, indent=4)

Step 10: Integrating All Steps Into a Complete Script


import json
import requests
from bs4 import BeautifulSoup

url = "https://github.com/TheKevJames/coveralls-python"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")

repo_title = soup.select_one('[itemprop="name"]').text.strip()

# branch
main_branch = soup.select_one(
   '[class="Box-sc-g0xbh4-0 ffLUq ref-selector-button-text-container"]').text.split()

# latest commit
relative_time_html_element = soup.select_one('relative-time')
latest_commit = relative_time_html_element['datetime']

# description
bordergrid_html_element = soup.select_one('.BorderGrid')
about_html_element = bordergrid_html_element.select_one('h2')
description_html_element = about_html_element.find_next_sibling('p')
description = description_html_element.get_text().strip()

# stars
star_icon_html_element = bordergrid_html_element.select_one('.octicon-star')
stars_html_element = star_icon_html_element.find_next_sibling('strong')
stars = stars_html_element.get_text().strip().replace(',', '')

# watchers
eye_icon_html_element = bordergrid_html_element.select_one('.octicon-eye')
watchers_html_element = eye_icon_html_element.find_next_sibling('strong')
watchers = watchers_html_element.get_text().strip().replace(',', '')

# forks
fork_icon_html_element = bordergrid_html_element.select_one('.octicon-repo-forked')
forks_html_element = fork_icon_html_element.find_next_sibling('strong')
forks = forks_html_element.get_text().strip().replace(',', '')

# readme
readme_url = f'https://github.com/TheKevJames/coveralls-python/blob/{main_branch}/readme.rst'
readme_page = requests.get(readme_url)

readme = None
if readme_page.status_code != 404:
   readme = readme_page.text

repo = {}
repo['name'] = repo_title
repo['latest_commit'] = latest_commit
repo['main_branch'] = main_branch
repo['description'] = description
repo['stars'] = stars
repo['watchers'] = watchers
repo['forks'] = forks
repo['readme'] = readme

with open('github_data.json', 'w', encoding='utf-8') as json_file:
   json.dump(repo, json_file, ensure_ascii=False, indent=4)

Scraping Github: Conclusion

We have examined the process of building a scraping GitHub script for repositories with the help of BeautifulSoup and Requests. You now know how to access web pages, pull out relevant data, and encode them in a user-friendly way. Such skills would be useful in analyzing well known projects, tracking changes to the code, or generating reports.

Nonetheless, have a care for sensible use. Most of the time, an API is available for use on GitHub, which will be simpler and more practical to work with. If you make the decision to perform web scraping, make sure that you comply with the guidelines of the site and do not bombard the servers with too many requests.

Comments:

0 comments