Study of Tesseract OCR
DOI:
https://doi.org/10.69974/glskalp.01.02.54Keywords:
Artificial Intelligence, Optical Character Recognition, TesseractAbstract
In the current Internet and Digitization era, a huge amount of information is available in different forms like books, newspapers, etc. To preserve the contents of such documents, these documents are converted to a digital format by scanning them as images. Detection of text from the scanned images and correct identification of characters is a challenging problem in such cases. Tesseract is a recognition engine based upon open source license which uses some novel techniques for optical character recognition. Tesseract has been designed to recognize more than 100 languages. Few of these languages are English, Italian, French, German, Spanish, Dutch and many more. It also works for a few Indian languages such as Bengali, Gujarati, Hindi, Kannada, Malayalam, Oriya and others. OCR is the branch of image recognition that is used in applications to recognize text from scanned documents or images. Today combined with the field of Artificial Intelligence this technology is becoming a boon to capture and comprehend the data automatically. In this paper, the researcher has done a detailed study of the working of the Tesseract OCR.
References
Bhatt, A. (2014). Information needs, perceptions and quests of law faculty in the digital era. The Electronic Library, 32(5), 659–669. https://doi.org/10.1108/el-11-2012-0152 DOI: https://doi.org/10.1108/EL-11-2012-0152
Blesser, B. A., Kuklinski, T. T., & Shillman, R. J. (1976). Empirical tests for feature selection based on a psychological theory of character recognition. Pattern Recognition, 8(2), 77-85. DOI: https://doi.org/10.1016/0031-3203(76)90036-4
Bokser, M. (1992). Omnidocument technologies. Proceedings of the IEEE, 80(7), 1066-1078. DOI: https://doi.org/10.1109/5.156470
Leptonica image processing and analysis library. http://www.leptonica.com.
Macwan, S. J., & Vyas, A. N. (2015, August). Classification of offline Gujarati handwritten characters. In 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI) (pp. 1535-1541). IEEE. DOI: https://doi.org/10.1109/ICACCI.2015.7275831
Marosi, I. (2007, January). Industrial OCR approaches: architecture, algorithms, and adaptation techniques. In Document Recognition and Retrieval XIV (Vol. 6500, p. 650002). International Society for Optics and Photonics. DOI: https://doi.org/10.1117/12.713912
Nagy, G. (1992). At the frontiers of OCR. Proceedings of the IEEE, 80(7), 1093-1100. DOI: https://doi.org/10.1109/5.156472
Nagy, G., & Xu, Y. (1997, August). Automatic prototype extraction for adaptive OCR. In Proceedings of the Fourth International Conference on Document Analysis and Recognition (Vol. 1, pp. 278-282). IEEE. DOI: https://doi.org/10.1109/ICDAR.1997.619856
Rousseeuw, P. J., & Leroy, A. M. (2005). Robust regression and outlier detection (Vol. 589). John wiley & sons.
Smith, R. W. (1987). The extraction and recognition of text from multimedia document images (Doctoral dissertation, University of Bristol).
Smith, R. W. (2009, July). Hybrid page layout analysis via tab-stop detection. In 2009 10th International Conference on Document Analysis and Recognition (pp. 241-245). IEEE. DOI: https://doi.org/10.1109/ICDAR.2009.257