Development of a gold-standard pashto dataset and a segmentation app
Name:
12553_ArticleText_27263_1_10_2 ...
Size:
1.109Mb
Format:
PDF
Description:
Final Published Version
Affiliation
University of Arizona LibrariesDepartment of Mathematics, University of Arizona
Issue Date
2021
Metadata
Show full item recordPublisher
American Library AssociationCitation
Han, Y., & Rychlik, M. (2021). Development of a Gold-standard Pashto Dataset and a Segmentation App. Information Technology and Libraries, 40(1).Rights
Copyright (c) 2021 Yan Han, Marek Rychlik. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.Collection Information
This item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at repository@u.library.arizona.edu.Abstract
The article aims to introduce a gold-standard Pashto dataset and a segmentation app. The Pashto dataset consists of 300 line images and corresponding Pashto text from three selected books. A line image is simply an image consisting of one text line from a scanned page. To our knowledge, this is one of the first open access datasets which directly maps line images to their corresponding text in the Pashto language. We also introduce the development of a segmentation app using textbox expanding algorithms, a different approach to OCR segmentation. The authors discuss the steps to build a Pashto dataset and develop our unique approach to segmentation. The article starts with the nature of the Pashto alphabet and its unique diacritics which require special considerations for segmentation. Needs for datasets and a few available Pashto datasets are reviewed. Criteria of selection of data sources are discussed and three books were selected by our language specialist from the Afghan Digital Repository. The authors review previous segmentation methods and introduce a new approach to segmentation for Pashto content. The segmentation app and results are discussed to show readers how to adjust variables for different books. Our unique segmentation approach uses an expanding textbox method which performs very well given the nature of the Pashto scripts. The app can also be used for Persian and other languages using the Arabic writing system. The dataset can be used for OCR training, OCR testing, and machine learning applications related to content in Pashto. © 2021.Note
Open access journalISSN
0730-9295Version
Final published versionae974a485f413a2113503eed53cd6c53
10.6017/ITAL.V40I1.12553
Scopus Count
Collections
Except where otherwise noted, this item's license is described as Copyright (c) 2021 Yan Han, Marek Rychlik. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.