Study of Tesseract OCR

Kartik Joshi

doi:10.69974/glskalp.01.02.54

Authors

Kartik Joshi CEO, Techsamvaad Pvt. Ltd Author

DOI:

https://doi.org/10.69974/glskalp.01.02.54

Keywords:

Artificial Intelligence, Optical Character Recognition, Tesseract

Abstract

In the current Internet and Digitization era, a huge amount of information is available in different forms like books, newspapers, etc. To preserve the contents of such documents, these documents are converted to a digital format by scanning them as images. Detection of text from the scanned images and correct identification of characters is a challenging problem in such cases. Tesseract is a recognition engine based upon open source license which uses some novel techniques for optical character recognition. Tesseract has been designed to recognize more than 100 languages. Few of these languages are English, Italian, French, German, Spanish, Dutch and many more. It also works for a few Indian languages such as Bengali, Gujarati, Hindi, Kannada, Malayalam, Oriya and others. OCR is the branch of image recognition that is used in applications to recognize text from scanned documents or images. Today combined with the field of Artificial Intelligence this technology is becoming a boon to capture and comprehend the data automatically. In this paper, the researcher has done a detailed study of the working of the Tesseract OCR.

References

Bhatt, A. (2014). Information needs, perceptions and quests of law faculty in the digital era. The Electronic Library, 32(5), 659–669. https://doi.org/10.1108/el-11-2012-0152 DOI: https://doi.org/10.1108/EL-11-2012-0152

Blesser, B. A., Kuklinski, T. T., & Shillman, R. J. (1976). Empirical tests for feature selection based on a psychological theory of character recognition. Pattern Recognition, 8(2), 77-85. DOI: https://doi.org/10.1016/0031-3203(76)90036-4

Bokser, M. (1992). Omnidocument technologies. Proceedings of the IEEE, 80(7), 1066-1078. DOI: https://doi.org/10.1109/5.156470

Leptonica image processing and analysis library. http://www.leptonica.com.

Macwan, S. J., & Vyas, A. N. (2015, August). Classification of offline Gujarati handwritten characters. In 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI) (pp. 1535-1541). IEEE. DOI: https://doi.org/10.1109/ICACCI.2015.7275831

Marosi, I. (2007, January). Industrial OCR approaches: architecture, algorithms, and adaptation techniques. In Document Recognition and Retrieval XIV (Vol. 6500, p. 650002). International Society for Optics and Photonics. DOI: https://doi.org/10.1117/12.713912

Nagy, G. (1992). At the frontiers of OCR. Proceedings of the IEEE, 80(7), 1093-1100. DOI: https://doi.org/10.1109/5.156472

Nagy, G., & Xu, Y. (1997, August). Automatic prototype extraction for adaptive OCR. In Proceedings of the Fourth International Conference on Document Analysis and Recognition (Vol. 1, pp. 278-282). IEEE. DOI: https://doi.org/10.1109/ICDAR.1997.619856

Rousseeuw, P. J., & Leroy, A. M. (2005). Robust regression and outlier detection (Vol. 589). John wiley & sons.

Smith, R. W. (1987). The extraction and recognition of text from multimedia document images (Doctoral dissertation, University of Bristol).

Smith, R. W. (2009, July). Hybrid page layout analysis via tab-stop detection. In 2009 10th International Conference on Document Analysis and Recognition (pp. 241-245). IEEE. DOI: https://doi.org/10.1109/ICDAR.2009.257

Study of Tesseract OCR

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

How to Cite

Call for Papers

Latest publications

Information