Development of a gold-standard pashto dataset and a segmentation app
AffiliationUniversity of Arizona Libraries
Department of Mathematics, University of Arizona
MetadataShow full item record
PublisherAmerican Library Association
CitationHan, Y., & Rychlik, M. (2021). Development of a Gold-standard Pashto Dataset and a Segmentation App. Information Technology and Libraries, 40(1).
RightsCopyright (c) 2021 Yan Han, Marek Rychlik. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Collection InformationThis item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at firstname.lastname@example.org.
AbstractThe article aims to introduce a gold-standard Pashto dataset and a segmentation app. The Pashto dataset consists of 300 line images and corresponding Pashto text from three selected books. A line image is simply an image consisting of one text line from a scanned page. To our knowledge, this is one of the first open access datasets which directly maps line images to their corresponding text in the Pashto language. We also introduce the development of a segmentation app using textbox expanding algorithms, a different approach to OCR segmentation. The authors discuss the steps to build a Pashto dataset and develop our unique approach to segmentation. The article starts with the nature of the Pashto alphabet and its unique diacritics which require special considerations for segmentation. Needs for datasets and a few available Pashto datasets are reviewed. Criteria of selection of data sources are discussed and three books were selected by our language specialist from the Afghan Digital Repository. The authors review previous segmentation methods and introduce a new approach to segmentation for Pashto content. The segmentation app and results are discussed to show readers how to adjust variables for different books. Our unique segmentation approach uses an expanding textbox method which performs very well given the nature of the Pashto scripts. The app can also be used for Persian and other languages using the Arabic writing system. The dataset can be used for OCR training, OCR testing, and machine learning applications related to content in Pashto. © 2021.
NoteOpen access journal
VersionFinal published version
Except where otherwise noted, this item's license is described as Copyright (c) 2021 Yan Han, Marek Rychlik. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.