December 22, 2024

[Development Tips] How to Get the Number of Strokes in Chinese Characters?

Blog

background

When developing a simple divination textI encountered this interesting problem. If there are only a few specific Chinese characters, we can hardcode a dictionary in the script, but what if we want to get the number of strokes of any Chinese character?

pypinyin library

from pypinyin import pinyin, Style

def get_strokes_count(chinese_character):
    pinyin_list = pinyin(chinese_character, style=Style.NORMAL)
    strokes_count = len(pinyin_list[0])
    return strokes_count

character = input("Please enter a Chinese character：")
strokes = get_strokes_count(character)
print("Character'{}'stroke numbers：{}".format(character, strokes))

I gave it a try and found that the result was actually the number of results in the normal pinyin format for that character.

Unified Han Database

The Unihan database is a Chinese character database maintained by the Unicode Alliance. It seems to be quite reliable, and it also provides online tools.

You can also check it onlineUnihan database searchI found that the query results contained k total number of strokes Field, this is the stroke count data we need.
As the official database of Unicode, the current version fully meets the basic needs of Chinese character query.

OK! We are closer to success!

Obtain stroke information from Unihan database

I originally planned to send the query request directly through lookup, but it was too slow, and the address came from abroad and from China. I found that the database file itself was not very large, so I downloaded it directly.

unity

After opening the compressed package, there are several files.

By looking for results, we need k total number of strokes in the field IRG source. Extract the file.
I tested the regex Regular expressions 101 Extract the required Unicode part and stroke part and store them separately for query.

coding

Retrieve stroke information

file = Path("Stroke/Unihan_IRGSources.txt")
output = Path("Stroke/unicode2stroke.json")
stroke_dict = dict()
with open(file,mode="r") as f:
    for line in f:
        raw_line = line.strip()
        pattern = r"(U\+.*)\skTotalStrokes.*\s(\d+)"
        result = re.findall(pattern=pattern, string=raw_line)
        if len(result) == 0:
            continue
        unicode_key = result[0][0]
        unicode_stroke = result[0][1]
        print(f"{unicode_key}: {unicode_stroke}")
        stroke_dict[unicode_key] = unicode_stroke

with open(file=output, mode="w", encoding="utf-8") as f:
    json.dump(stroke_dict,f, ensure_ascii=False, indent=4)

Export to json for easy access.

Write collection function

with open(output) as f:
    unicode2stroke = json.load(f)

def get_character_stroke_count(char: str):
    unicode = "U+" + str(hex(ord(char)))[2:].upper()
    return int(unicode2stroke[unicode])

test_char = "阿"
get_character_stroke_count(char=test_char)

When obtaining, please note that Unicode converts characters into their corresponding hexadecimal encodings.

success! The expected effect was achieved!

2024-12-22 14:55:06

characters Chinese Development number Strokes Tips video dawnloader free video dawnloader free online VideoDDD