Webscraping com Python: usando CSV como base de dados
December 22, 2024

Webscraping com Python: usando CSV como base de dados

I have an interesting request these days. Someone uses CSV to move data from one place to another. These data are book registrations for reading programs. At one point, she said to me, “Well, now the rest of the job is for the robot. I have to get the ISBN of each book. As she said, this is the robot’s job, why not let the robot do it?”

Sigla para International Standard Book Number. 

A work can have multiple ISBNs because versions have their own ISBNs. In this case, any ISBN can be used as long as the media is compatible. The following content is registered in the CSV:
-> e-book
-> Finance
->Message

Let’s look at the logic:
-> Upload and open the CSV file.
-> Extract columns with headers.
-> Extract media column.
-> For each title, search Google by ISBN.
->Extract the title from the page.
-> Extract ISBN list.
-> Extract media list.
-> Checks the registered media and searches for the nearest ISBN, if our criteria is not found, returns the first item in the list.
-> Let us know from which media the ISBN was obtained for later verification.

The following are the necessary libraries:

import requests # para fazer as requisições
from bs4 import BeautifulSoup # para manipular o html recebido
import pandas as pd # para manipular os arquivos CSV
import time
import random # as duas são para gerarmos intervalos aleatórios de acesso
Enter full screen mode

Exit full screen mode

There are over 600 books in this book list, and since I don’t want to be blocked by Google, we’ll randomly access them in a more user-friendly space. We’ll also use a header to indicate which browser version of the page we want. To do this, go to “Network” in your browser and search for “User Agent”.

To search on Google, we use the following URL pattern:

url_base = "https://www.google.com/search?q=isbn" # o que vem depois '=' é a pesquisa
Enter full screen mode

Exit full screen mode

Remember, URLs don’t have spaces, so we’ll replace the spaces in the title with “+”. In pandas, a “spreadsheet” is called a DataFrame, and it is common to use df as an abbreviation. Finally, maybe you’re using Windows like me, in which case the system address bar is Unix-relative. Let’s write a function that takes the URL we pasted and reverses it into another format.

path = r"C:\caminho\livros.csv"

def invert_url_pattern(url):
    return url.replace("\\","/")

path = invert_url_pattern(path)

def search_book(path):
    url_base = "https://www.google.com/search?q=isbn"
    headers = {
    "User-Agent":"seu pc"
    }
    
    df = pd.read_csv(path, encoding='utf-8')
    books = df["Name"].tolist()
    media = df["media"].tolist()
    # vamos colocar as pesquisas aqui e depois inserir todas no DataFrame
    title_books = []
    isbn_books = []
    media_books = []  

    for index, book in enumerate(books):
        time.sleep(random.uniform(60, 90))
        
        url = url_base + "+" + book.replace(" ", "+")
        req = requests.get(url, headers=headers)

        site = BeautifulSoup(req.text, "html.parser")
        #usamos as class para buscar o conteúdo
        title = site.find("span", class_="Wkr6U")
        isbns = site.find_all("div", class_="bVj5Zb")
        medias = site.find_all("div", class_="TCYkdd")
        #se algo falhar, retornamos uma string vazia
        if(title.text == None):
            title_books.append("")
            isbn_books.append("")
            media_books.append("")
            continue

        # No loop, o último item acessado será o mais recente, 
        # pois percorremos a lista de cima para baixo. 
        # Por isso, invertendo a lista de ISBNs, garantimos que 
        # o mais novo de cada categoria seja processado por último.

        isbns = isbns[::-1]
        unified_data = {}

        for i in range(len(medias)):
            unified_data[medias[i].text] = isbns[i].text

        match media[index]:
            case "ebook":
                isbn_books.append(unified_data["Livro digital"])
                media_books.append("Livro digital")
            case "fisical":
                isbn_books.append(unified_data["Livro capa dura"])
                media_books.append("Livro capa dura")
            case "audio":
                isbn_books.append(unified_data["Audiolivro"])
                media_books.append("Audiolivro")
            case _:
                isbn_books.append(unified_data[0])
                media_books.append("")

        title_books.append(title.text)

    df["Titulo do Livro"] = title_books
    df["ISBN"] = isbn_books
    df["Tipo de Livro"] = media_books

    return df
Enter full screen mode

Exit full screen mode

Okay, everything is ready, just waiting for us to test! I’ll leave the example line I received so you can test it there.

Name language media
This other Eden 🇬🇧 English vocal
df = search_book(path)

df.to_csv(invert_url_pattern("C:seu\caminho\para\salvar\nome_do_arquivo.csv"), encoding='utf-8', index=False)
Enter full screen mode

Exit full screen mode

I hope it’s useful to you and that you can implement some automation into your daily life!

2024-12-22 13:45:20

Leave a Reply

Your email address will not be published. Required fields are marked *