In this article, we will analyze how to scrape GitHub repositories using Python. We will review well-known libraries like BeautifulSoup and Requests and give a detailed explanation while building scraping GitHub script so that you can follow along easily.
What is Github scraping and what is it for? There are numerous purposes to use it, but the most frequent include:
Not only does GitHub offer hosting for repositories, but they have a massive user base, and a great reputation makes them a reliable option.
The information stored on GitHub is useful in tracking the patterns of technology progression as well as improving the development of software. This information is critical to keeping up with competitors in the world of information technology.
When it comes to scraping a web page, the easiest language to work with is Python, since its libraries and modules are so many. In order to Github scraping Python, the following modules have to be added:
This can be done so easily with the use of Github scraping tools. However, let us elaborate on how to do so.
This part will show how to scrape Github repositories. Important steps in making the scraper consist of:
From now on, we’ll share the details for all these steps along with a fully formed scraping Github script.
Make sure you have Python on your machine. Next, make a new Python virtual environment for starting the process of scraping Github.
python -m venv github_scraper
source github_scraper/bin/activate # For MacOS and Linux
github_scraper\Scripts\activate # For Windows
Like we said before, BeautifulSoup and Requests will help for scraping GitHub repositories. In the currently activated virtual environment, execute this command to include them in your project dependencies:
pip install beautifulsoup4 requests
Select any repository from which you would like to perform scraping Github. First, define the link in a variable, then make an HTTP request that will get the page code.
url = "https://github.com/TheKevJames/coveralls-python"
response = requests.get(url)
To analyze your HTML, you feed it into BeautifulSoup.
soup = BeautifulSoup(page.text, 'html.parser')
The BeautifulSoup() constructor needs to be provided with two things.
The HTML will be parsed by BeautifulSoup and a tree structure will be generated. More specifically, the soup variable contains all the methods required to select relevant elements from the DOM tree such as:
Here's another important step: picking simple HTML elements and scraping Github data from them. This step precedes actually writing Python Github repositories script.
Before scraping Github, familiarize yourself with the webpage itself. After this, open the developer tools by clicking F12. Now that you have opened the code, you will notice that many elements on the page do not have unique classes or attributes which would make it easy to navigate to the appropriate element. Go through the page and prepare to extract data.
Now we are ready to create a Python script that will help us for scraping GitHub repositories. This script can extract useful information stored in GitHub like stars, descriptions, last commits, etc. For this, we need to specify the required features and retrieve corresponding text values.
repo_title = soup.select_one('[itemprop="name"]').text.strip()
The attribute itemprop=”name” has an attribute value which has a unique identity, so we therefore retrieve it. Text fields on GitHub generally have spaces and newline characters and can be cleaned by using strip().
git_branch_icon_html_element = soup.select_one('[class="Box-sc-g0xbh4-0 ffLUq ref-selector-button-text-container"]').text.split()
Pay attention to the fact that there is no simpler method of selecting the HTML component that holds the title of the principal branch. What can you do? Pick a class that is unique and get the text.
relative_time_html_element = soup.select_one('relative-time')
latest_commit = relative_time_html_element['datetime']
We noted that the relative-time tag, where the last commit by the user was stored, was already selected, and the date was selected through datetime.
Gather information which is on the left hand side: description, stars, views, forks.
bordergrid_html_element = soup.select_one('.BorderGrid')
about_html_element = bordergrid_html_element.select_one('h2')
description_html_element = about_html_element.find_next_sibling('p')
description = description_html_element.get_text().strip()
star_icon_html_element = bordergrid_html_element.select_one('.octicon-star')
stars_html_element = star_icon_html_element.find_next_sibling('strong')
stars = stars_html_element.get_text().strip().replace(',', '')
eye_icon_html_element = bordergrid_html_element.select_one('.octicon-eye')
watchers_html_element = eye_icon_html_element.find_next_sibling('strong')
watchers = watchers_html_element.get_text().strip().replace(',', '')
fork_icon_html_element = bordergrid_html_element.select_one('.octicon-repo-forked')
forks_html_element = fork_icon_html_element.find_next_sibling('strong')
forks = forks_html_element.get_text().strip().replace(',', '')
The readme file is very crucial. It provides descriptions of the repositories and instructions on how to implement the code. In the case that you check the readme.md file, you can notice what link it has:
https://raw.githubusercontent.com///readme.md
Since we have , we can programmatically create the URL using an f-string, use it, and make an HTTP request to get the file code.
readme_url = f'https://github.com/TheKevJames/coveralls-python/blob/{main_branch}/readme.rst'
readme_page = requests.get(readme_url)
readme = None
if readme_page.status_code != 404:
readme = readme_page.text
Remember to check for the 404 error to avert saving the content of the GitHub 404 webpage if the repository does not have a readme file.
All the information will be stored in a single dictionary so we can easily write them into a JSON file.
repo = {}
repo['name'] = repo_title
repo['latest_commit'] = latest_commit
repo['main_branch'] = main_branch
repo['description'] = description
repo['stars'] = stars
repo['watchers'] = watchers
repo['forks'] = forks
repo['readme'] = readme
We’ll leverage the built-in Python library to analyze the data and store it in the JSON format, as it’s perfect for nested structures like in our case when articles have lists.
with open('github_data.json', 'w', encoding='utf-8') as json_file:
json.dump(repo, json_file, ensure_ascii=False, indent=4)
import json
import requests
from bs4 import BeautifulSoup
url = "https://github.com/TheKevJames/coveralls-python"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
repo_title = soup.select_one('[itemprop="name"]').text.strip()
# branch
main_branch = soup.select_one(
'[class="Box-sc-g0xbh4-0 ffLUq ref-selector-button-text-container"]').text.split()
# latest commit
relative_time_html_element = soup.select_one('relative-time')
latest_commit = relative_time_html_element['datetime']
# description
bordergrid_html_element = soup.select_one('.BorderGrid')
about_html_element = bordergrid_html_element.select_one('h2')
description_html_element = about_html_element.find_next_sibling('p')
description = description_html_element.get_text().strip()
# stars
star_icon_html_element = bordergrid_html_element.select_one('.octicon-star')
stars_html_element = star_icon_html_element.find_next_sibling('strong')
stars = stars_html_element.get_text().strip().replace(',', '')
# watchers
eye_icon_html_element = bordergrid_html_element.select_one('.octicon-eye')
watchers_html_element = eye_icon_html_element.find_next_sibling('strong')
watchers = watchers_html_element.get_text().strip().replace(',', '')
# forks
fork_icon_html_element = bordergrid_html_element.select_one('.octicon-repo-forked')
forks_html_element = fork_icon_html_element.find_next_sibling('strong')
forks = forks_html_element.get_text().strip().replace(',', '')
# readme
readme_url = f'https://github.com/TheKevJames/coveralls-python/blob/{main_branch}/readme.rst'
readme_page = requests.get(readme_url)
readme = None
if readme_page.status_code != 404:
readme = readme_page.text
repo = {}
repo['name'] = repo_title
repo['latest_commit'] = latest_commit
repo['main_branch'] = main_branch
repo['description'] = description
repo['stars'] = stars
repo['watchers'] = watchers
repo['forks'] = forks
repo['readme'] = readme
with open('github_data.json', 'w', encoding='utf-8') as json_file:
json.dump(repo, json_file, ensure_ascii=False, indent=4)
We have examined the process of building a scraping GitHub script for repositories with the help of BeautifulSoup and Requests. You now know how to access web pages, pull out relevant data, and encode them in a user-friendly way. Such skills would be useful in analyzing well known projects, tracking changes to the code, or generating reports.
Nonetheless, have a care for sensible use. Most of the time, an API is available for use on GitHub, which will be simpler and more practical to work with. If you make the decision to perform web scraping, make sure that you comply with the guidelines of the site and do not bombard the servers with too many requests.
Comments: 0