I have an interesting request these days. Someone uses CSV to move data from one place to another. These data are book registrations for reading programs. At one point, she said to me, “Well, now the rest of the job is for the robot. I have to get the ISBN of each book. As she said, this is the robot’s job, why not let the robot do it?”
Sigla para International Standard Book Number.
A work can have multiple ISBNs because versions have their own ISBNs. In this case, any ISBN can be used as long as the media is compatible. The following content is registered in the CSV:
-> e-book
-> Finance
->Message
Let’s look at the logic:
-> Upload and open the CSV file.
-> Extract columns with headers.
-> Extract media column.
-> For each title, search Google by ISBN.
->Extract the title from the page.
-> Extract ISBN list.
-> Extract media list.
-> Checks the registered media and searches for the nearest ISBN, if our criteria is not found, returns the first item in the list.
-> Let us know from which media the ISBN was obtained for later verification.
The following are the necessary libraries:
import requests # para fazer as requisições
from bs4 import BeautifulSoup # para manipular o html recebido
import pandas as pd # para manipular os arquivos CSV
import time
import random # as duas são para gerarmos intervalos aleatórios de acesso
There are over 600 books in this book list, and since I don’t want to be blocked by Google, we’ll randomly access them in a more user-friendly space. We’ll also use a header to indicate which browser version of the page we want. To do this, go to “Network” in your browser and search for “User Agent”.
To search on Google, we use the following URL pattern:
url_base = "https://www.google.com/search?q=isbn" # o que vem depois '=' é a pesquisa
Remember, URLs don’t have spaces, so we’ll replace the spaces in the title with “+”. In pandas, a “spreadsheet” is called a DataFrame, and it is common to use df as an abbreviation. Finally, maybe you’re using Windows like me, in which case the system address bar is Unix-relative. Let’s write a function that takes the URL we pasted and reverses it into another format.
path = r"C:\caminho\livros.csv"
def invert_url_pattern(url):
return url.replace("\\","/")
path = invert_url_pattern(path)
def search_book(path):
url_base = "https://www.google.com/search?q=isbn"
headers = {
"User-Agent":"seu pc"
}
df = pd.read_csv(path, encoding='utf-8')
books = df["Name"].tolist()
media = df["media"].tolist()
# vamos colocar as pesquisas aqui e depois inserir todas no DataFrame
title_books = []
isbn_books = []
media_books = []
for index, book in enumerate(books):
time.sleep(random.uniform(60, 90))
url = url_base + "+" + book.replace(" ", "+")
req = requests.get(url, headers=headers)
site = BeautifulSoup(req.text, "html.parser")
#usamos as class para buscar o conteúdo
title = site.find("span", class_="Wkr6U")
isbns = site.find_all("div", class_="bVj5Zb")
medias = site.find_all("div", class_="TCYkdd")
#se algo falhar, retornamos uma string vazia
if(title.text == None):
title_books.append("")
isbn_books.append("")
media_books.append("")
continue
# No loop, o último item acessado será o mais recente,
# pois percorremos a lista de cima para baixo.
# Por isso, invertendo a lista de ISBNs, garantimos que
# o mais novo de cada categoria seja processado por último.
isbns = isbns[::-1]
unified_data = {}
for i in range(len(medias)):
unified_data[medias[i].text] = isbns[i].text
match media[index]:
case "ebook":
isbn_books.append(unified_data["Livro digital"])
media_books.append("Livro digital")
case "fisical":
isbn_books.append(unified_data["Livro capa dura"])
media_books.append("Livro capa dura")
case "audio":
isbn_books.append(unified_data["Audiolivro"])
media_books.append("Audiolivro")
case _:
isbn_books.append(unified_data[0])
media_books.append("")
title_books.append(title.text)
df["Titulo do Livro"] = title_books
df["ISBN"] = isbn_books
df["Tipo de Livro"] = media_books
return df
Okay, everything is ready, just waiting for us to test! I’ll leave the example line I received so you can test it there.
Name | language | media |
---|---|---|
This other Eden | 🇬🇧 English | vocal |
df = search_book(path)
df.to_csv(invert_url_pattern("C:seu\caminho\para\salvar\nome_do_arquivo.csv"), encoding='utf-8', index=False)
I hope it’s useful to you and that you can implement some automation into your daily life!