background
When developing a simple divination textI encountered this interesting problem. If there are only a few specific Chinese characters, we can hardcode a dictionary in the script, but what if we want to get the number of strokes of any Chinese character?
pypinyin library
from pypinyin import pinyin, Style
def get_strokes_count(chinese_character):
pinyin_list = pinyin(chinese_character, style=Style.NORMAL)
strokes_count = len(pinyin_list[0])
return strokes_count
character = input("Please enter a Chinese character:")
strokes = get_strokes_count(character)
print("Character'{}'stroke numbers:{}".format(character, strokes))
I gave it a try and found that the result was actually the number of results in the normal pinyin format for that character.
Unified Han Database
The Unihan database is a Chinese character database maintained by the Unicode Alliance. It seems to be quite reliable, and it also provides online tools.
You can also check it onlineUnihan database searchI found that the query results contained k total number of strokes Field, this is the stroke count data we need.
As the official database of Unicode, the current version fully meets the basic needs of Chinese character query.
Obtain stroke information from Unihan database
I originally planned to send the query request directly through lookup, but it was too slow, and the address came from abroad and from China. I found that the database file itself was not very large, so I downloaded it directly.
After opening the compressed package, there are several files.
By looking for results, we need k total number of strokes in the field IRG source. Extract the file.
I tested the regex Regular expressions 101 Extract the required Unicode part and stroke part and store them separately for query.
coding
- Retrieve stroke information
file = Path("Stroke/Unihan_IRGSources.txt")
output = Path("Stroke/unicode2stroke.json")
stroke_dict = dict()
with open(file,mode="r") as f:
for line in f:
raw_line = line.strip()
pattern = r"(U\+.*)\skTotalStrokes.*\s(\d+)"
result = re.findall(pattern=pattern, string=raw_line)
if len(result) == 0:
continue
unicode_key = result[0][0]
unicode_stroke = result[0][1]
print(f"{unicode_key}: {unicode_stroke}")
stroke_dict[unicode_key] = unicode_stroke
with open(file=output, mode="w", encoding="utf-8") as f:
json.dump(stroke_dict,f, ensure_ascii=False, indent=4)
Export to json for easy access.
- Write collection function
with open(output) as f:
unicode2stroke = json.load(f)
def get_character_stroke_count(char: str):
unicode = "U+" + str(hex(ord(char)))[2:].upper()
return int(unicode2stroke[unicode])
test_char = "阿"
get_character_stroke_count(char=test_char)
When obtaining, please note that Unicode converts characters into their corresponding hexadecimal encodings.
success! The expected effect was achieved!