Exploration and Research on OCR Technology for Standardized Yi Character

Junping Huang

doi:10.54097/babd6q58

Authors

Junping Huang

DOI:

https://doi.org/10.54097/babd6q58

Keywords:

Standardized Yi character; Optical Character Recognition; Deep learning; OpenCV; Tesseract-OCR.

Abstract

Optical Character Recognition (OCR) refers to the process of analyzing, understanding, and recognizing textual information in image files, which is beneficial for extracting and collecting standardized Yi text information from paper materials. Using Tesseract-OCR software, based on the coding standards of standardized Yi characters and the font characteristics of standardized Yi characters, a deep learning training and recognition result validation were conducted using Long Short Term Memory Neural Network (LSTM) for standardized Yi character recognition. The Python programming language called OpenCV library, Pytesseract library, and PyQt5 library to achieve the construction of a standardized Yi character recognition system. Experiments show that the system can realize the characters recognition of Yi characters in Baiti, Songti, Heiti and Xiheiti, and has off-line operation and high recognition accuracy.

Downloads

Download data is not yet available.

References

[1] Liu Sai ,Li Yidong.Design and Realization on CharacterSegmentation Method for Yi Language[J]. Journal of South-Central University for Nationalities (Nat. Sci. Edition). 2007(03): 70-72.

[2] Wu Bing.Research on the Analysis of Standardized Yi characters from the Perspective of Character Recognition[J]. Journal of Southwest Minzu University(Humanities and Social Sciences Edition). 2018, 39(09):46-53.

[3] Liang Hao .Study of Nonliner Gaussian Filters and its Application to CNS/SAR/SINS Integrated Navigation [D]. Harbin Institute of Technology,2015.

[4] https://github.com/tesseract-ocr/tesseract.