Detailed Tutorial: Crawling GitHub Repository Folders Without API
December 14, 2024

Detailed Tutorial: Crawling GitHub Repository Folders Without API


Super detailed tutorial: Crawling GitHub repository folders without using API

This ultra-detailed tutorial, written by Shpetim Haxhiu, walks you through programmatically crawling GitHub repository folders without relying on the GitHub API. It includes everything from understanding structures to providing robust recursive implementations with enhanced functionality.




1. Setup and installation

Before you begin, make sure you have:

  1. Python: Version 3.7 or higher is installed.
  2. library: Install requests and BeautifulSoup.
   pip install requests beautifulsoup4
Enter full screen mode

Exit full screen mode

  1. edit: Any IDE that supports Python, such as VS Code or PyCharm.



2. Analyze GitHub HTML structure

To crawl a GitHub folder, you need to understand the HTML structure of the repository page. On the GitHub repository page:

  • folder Link with similar paths /tree//.
  • document Link with similar paths /blob//.

Each item (folder or file) is located in

with the attribute role="rowheader" and contains a Label. For example:


Enter full screen mode

Exit full screen mode




3. Implement crawlers



3.1. Recursive crawling function

This script will recursively grab the folder and print its structure. To limit the recursion depth and avoid unnecessary loads we will use depth scope.

import requests
from bs4 import BeautifulSoup
import time

def crawl_github_folder(url, depth=0, max_depth=3):
    """
    Recursively crawls a GitHub repository folder structure.

    Parameters:
    - url (str): URL of the GitHub folder to scrape.
    - depth (int): Current recursion depth.
    - max_depth (int): Maximum depth to recurse.
    """
    if depth > max_depth:
        return

    headers = {"User-Agent": "Mozilla/5.0"}
    response = requests.get(url, headers=headers)

    if response.status_code != 200:
        print(f"Failed to access {url} (Status code: {response.status_code})")
        return

    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract folder and file links
    items = soup.select('div[role="rowheader"] a')

    for item in items:
        item_name = item.text.strip()
        item_url = f"https://github.com{item['href']}"

        if '/tree/' in item_url:
            print(f"{'  ' * depth}Folder: {item_name}")
            crawl_github_folder(item_url, depth + 1, max_depth)
        elif '/blob/' in item_url:
            print(f"{'  ' * depth}File: {item_name}")

# Example usage
if __name__ == "__main__":
    repo_url = "https://github.com///tree//"
    crawl_github_folder(repo_url)
Enter full screen mode

Exit full screen mode




4. Feature Description

  1. Request headers:use User-Agent string to imitate the browser and avoid blocking.
  2. recursive crawl:

    • Detection folder (/tree/) and enter them recursively.
    • list files (/blob/) requires no further input.
  3. indent: Reflects the folder hierarchy in the output.
  4. depth limit: Prevent excessive recursion by setting the maximum depth (max_depth).



5. Enhanced functions

These enhancements are designed to improve the functionality and reliability of the crawler. They address common challenges such as exporting results, handling errors, and avoiding rate limits, ensuring the tool is efficient and user-friendly.



5.1. Export results

Save the output to a structured JSON file for easy consumption.

import json

def crawl_to_json(url, depth=0, max_depth=3):
    """Crawls and saves results as JSON."""
    result = {}

    if depth > max_depth:
        return result

    headers = {"User-Agent": "Mozilla/5.0"}
    response = requests.get(url, headers=headers)

    if response.status_code != 200:
        print(f"Failed to access {url}")
        return result

    soup = BeautifulSoup(response.text, 'html.parser')
    items = soup.select('div[role="rowheader"] a')

    for item in items:
        item_name = item.text.strip()
        item_url = f"https://github.com{item['href']}"

        if '/tree/' in item_url:
            result[item_name] = crawl_to_json(item_url, depth + 1, max_depth)
        elif '/blob/' in item_url:
            result[item_name] = "file"

    return result

if __name__ == "__main__":
    repo_url = "https://github.com///tree//"
    structure = crawl_to_json(repo_url)

    with open("output.json", "w") as file:
        json.dump(structure, file, indent=2)

    print("Repository structure saved to output.json")
Enter full screen mode

Exit full screen mode



5.2. Error handling

Added powerful error handling for web page errors and unexpected HTML changes:

try:
    response = requests.get(url, headers=headers, timeout=10)
    response.raise_for_status()
except requests.exceptions.RequestException as e:
    print(f"Error fetching {url}: {e}")
    return
Enter full screen mode

Exit full screen mode



5.3. Rate Limiting

To avoid being rate limited by GitHub, introduce a delay:

import time

def crawl_with_delay(url, depth=0):
    time.sleep(2)  # Delay between requests
    # Crawling logic here
Enter full screen mode

Exit full screen mode




6. Ethical considerations

This section, written by Shpetim Haxhiu, an expert in software automation and ethical programming, ensures that best practices are followed when using GitHub crawlers.

  • obey: persist in GitHub's Terms of Service.
  • Minimize load: Respect GitHub's servers by limiting requests and increasing latency.
  • allow: Obtain permission to perform a broad crawl of private repositories.



7. Complete code

Here's a comprehensive script with all the functionality:

import requests
from bs4 import BeautifulSoup
import json
import time

def crawl_github_folder(url, depth=0, max_depth=3):
    result = {}

    if depth > max_depth:
        return result

    headers = {"User-Agent": "Mozilla/5.0"}
    try:
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return result

    soup = BeautifulSoup(response.text, 'html.parser')
    items = soup.select('div[role="rowheader"] a')

    for item in items:
        item_name = item.text.strip()
        item_url = f"https://github.com{item['href']}"

        if '/tree/' in item_url:
            print(f"{'  ' * depth}Folder: {item_name}")
            result[item_name] = crawl_github_folder(item_url, depth + 1, max_depth)
        elif '/blob/' in item_url:
            print(f"{'  ' * depth}File: {item_name}")
            result[item_name] = "file"

    time.sleep(2)  # Avoid rate-limiting
    return result

if __name__ == "__main__":
    repo_url = "https://github.com///tree//"
    structure = crawl_github_folder(repo_url)

    with open("output.json", "w") as file:
        json.dump(structure, file, indent=2)

    print("Repository structure saved to output.json")
Enter full screen mode

Exit full screen mode


By following this detailed guide, you can build a powerful GitHub folder crawler. The tool can be adapted to various needs while ensuring ethical compliance.


Feel free to leave questions in the comment area! Also, don’t forget to contact me:

2024-12-14 00:01:14

Leave a Reply

Your email address will not be published. Required fields are marked *