Super detailed tutorial: Crawling GitHub repository folders without using API
This ultra-detailed tutorial, written by Shpetim Haxhiu, walks you through programmatically crawling GitHub repository folders without relying on the GitHub API. It includes everything from understanding structures to providing robust recursive implementations with enhanced functionality.
1. Setup and installation
Before you begin, make sure you have:
- Python: Version 3.7 or higher is installed.
-
library: Install
requests
andBeautifulSoup
.
pip install requests beautifulsoup4
- edit: Any IDE that supports Python, such as VS Code or PyCharm.
2. Analyze GitHub HTML structure
To crawl a GitHub folder, you need to understand the HTML structure of the repository page. On the GitHub repository page:
-
folder Link with similar paths
/tree/
./ -
document Link with similar paths
/blob/
./
Each item (folder or file) is located in
role="rowheader"
and contains a
Label. For example:
3. Implement crawlers
3.1. Recursive crawling function
This script will recursively grab the folder and print its structure. To limit the recursion depth and avoid unnecessary loads we will use depth
scope.
import requests
from bs4 import BeautifulSoup
import time
def crawl_github_folder(url, depth=0, max_depth=3):
"""
Recursively crawls a GitHub repository folder structure.
Parameters:
- url (str): URL of the GitHub folder to scrape.
- depth (int): Current recursion depth.
- max_depth (int): Maximum depth to recurse.
"""
if depth > max_depth:
return
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
if response.status_code != 200:
print(f"Failed to access {url} (Status code: {response.status_code})")
return
soup = BeautifulSoup(response.text, 'html.parser')
# Extract folder and file links
items = soup.select('div[role="rowheader"] a')
for item in items:
item_name = item.text.strip()
item_url = f"https://github.com{item['href']}"
if '/tree/' in item_url:
print(f"{' ' * depth}Folder: {item_name}")
crawl_github_folder(item_url, depth + 1, max_depth)
elif '/blob/' in item_url:
print(f"{' ' * depth}File: {item_name}")
# Example usage
if __name__ == "__main__":
repo_url = "https://github.com///tree// "
crawl_github_folder(repo_url)
4. Feature Description
-
Request headers:use
User-Agent
string to imitate the browser and avoid blocking. -
recursive crawl:
- Detection folder (
/tree/
) and enter them recursively. - list files (
/blob/
) requires no further input.
- Detection folder (
- indent: Reflects the folder hierarchy in the output.
-
depth limit: Prevent excessive recursion by setting the maximum depth (
max_depth
).
5. Enhanced functions
These enhancements are designed to improve the functionality and reliability of the crawler. They address common challenges such as exporting results, handling errors, and avoiding rate limits, ensuring the tool is efficient and user-friendly.
5.1. Export results
Save the output to a structured JSON file for easy consumption.
import json
def crawl_to_json(url, depth=0, max_depth=3):
"""Crawls and saves results as JSON."""
result = {}
if depth > max_depth:
return result
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
if response.status_code != 200:
print(f"Failed to access {url}")
return result
soup = BeautifulSoup(response.text, 'html.parser')
items = soup.select('div[role="rowheader"] a')
for item in items:
item_name = item.text.strip()
item_url = f"https://github.com{item['href']}"
if '/tree/' in item_url:
result[item_name] = crawl_to_json(item_url, depth + 1, max_depth)
elif '/blob/' in item_url:
result[item_name] = "file"
return result
if __name__ == "__main__":
repo_url = "https://github.com///tree// "
structure = crawl_to_json(repo_url)
with open("output.json", "w") as file:
json.dump(structure, file, indent=2)
print("Repository structure saved to output.json")
5.2. Error handling
Added powerful error handling for web page errors and unexpected HTML changes:
try:
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
except requests.exceptions.RequestException as e:
print(f"Error fetching {url}: {e}")
return
5.3. Rate Limiting
To avoid being rate limited by GitHub, introduce a delay:
import time
def crawl_with_delay(url, depth=0):
time.sleep(2) # Delay between requests
# Crawling logic here
6. Ethical considerations
This section, written by Shpetim Haxhiu, an expert in software automation and ethical programming, ensures that best practices are followed when using GitHub crawlers.
- obey: persist in GitHub's Terms of Service.
- Minimize load: Respect GitHub's servers by limiting requests and increasing latency.
- allow: Obtain permission to perform a broad crawl of private repositories.
7. Complete code
Here's a comprehensive script with all the functionality:
import requests
from bs4 import BeautifulSoup
import json
import time
def crawl_github_folder(url, depth=0, max_depth=3):
result = {}
if depth > max_depth:
return result
headers = {"User-Agent": "Mozilla/5.0"}
try:
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
except requests.exceptions.RequestException as e:
print(f"Error fetching {url}: {e}")
return result
soup = BeautifulSoup(response.text, 'html.parser')
items = soup.select('div[role="rowheader"] a')
for item in items:
item_name = item.text.strip()
item_url = f"https://github.com{item['href']}"
if '/tree/' in item_url:
print(f"{' ' * depth}Folder: {item_name}")
result[item_name] = crawl_github_folder(item_url, depth + 1, max_depth)
elif '/blob/' in item_url:
print(f"{' ' * depth}File: {item_name}")
result[item_name] = "file"
time.sleep(2) # Avoid rate-limiting
return result
if __name__ == "__main__":
repo_url = "https://github.com///tree// "
structure = crawl_github_folder(repo_url)
with open("output.json", "w") as file:
json.dump(structure, file, indent=2)
print("Repository structure saved to output.json")
By following this detailed guide, you can build a powerful GitHub folder crawler. The tool can be adapted to various needs while ensuring ethical compliance.
Feel free to leave questions in the comment area! Also, don’t forget to contact me: