Show simple item record

dc.contributor.authorHan, Y.
dc.contributor.authorRychlik, M.
dc.date.accessioned2021-06-17T01:09:43Z
dc.date.available2021-06-17T01:09:43Z
dc.date.issued2021
dc.identifier.citationHan, Y., & Rychlik, M. (2021). Development of a Gold-standard Pashto Dataset and a Segmentation App. Information Technology and Libraries, 40(1).
dc.identifier.issn0730-9295
dc.identifier.doi10.6017/ITAL.V40I1.12553
dc.identifier.urihttp://hdl.handle.net/10150/659954
dc.description.abstractThe article aims to introduce a gold-standard Pashto dataset and a segmentation app. The Pashto dataset consists of 300 line images and corresponding Pashto text from three selected books. A line image is simply an image consisting of one text line from a scanned page. To our knowledge, this is one of the first open access datasets which directly maps line images to their corresponding text in the Pashto language. We also introduce the development of a segmentation app using textbox expanding algorithms, a different approach to OCR segmentation. The authors discuss the steps to build a Pashto dataset and develop our unique approach to segmentation. The article starts with the nature of the Pashto alphabet and its unique diacritics which require special considerations for segmentation. Needs for datasets and a few available Pashto datasets are reviewed. Criteria of selection of data sources are discussed and three books were selected by our language specialist from the Afghan Digital Repository. The authors review previous segmentation methods and introduce a new approach to segmentation for Pashto content. The segmentation app and results are discussed to show readers how to adjust variables for different books. Our unique segmentation approach uses an expanding textbox method which performs very well given the nature of the Pashto scripts. The app can also be used for Persian and other languages using the Arabic writing system. The dataset can be used for OCR training, OCR testing, and machine learning applications related to content in Pashto. © 2021.
dc.language.isoen
dc.publisherAmerican Library Association
dc.rightsCopyright (c) 2021 Yan Han, Marek Rychlik. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
dc.rights.urihttps://creativecommons.org/licenses/by-nc/4.0/
dc.titleDevelopment of a gold-standard pashto dataset and a segmentation app
dc.typeArticle
dc.typetext
dc.contributor.departmentUniversity of Arizona Libraries
dc.contributor.departmentDepartment of Mathematics, University of Arizona
dc.identifier.journalInformation Technology and Libraries
dc.description.noteOpen access journal
dc.description.collectioninformationThis item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at repository@u.library.arizona.edu.
dc.eprint.versionFinal published version
dc.source.journaltitleInformation Technology and Libraries
refterms.dateFOA2021-06-17T01:09:43Z


Files in this item

Thumbnail
Name:
12553_ArticleText_27263_1_10_2 ...
Size:
1.109Mb
Format:
PDF
Description:
Final Published Version

This item appears in the following Collection(s)

Show simple item record

Copyright (c) 2021 Yan Han, Marek Rychlik. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Except where otherwise noted, this item's license is described as Copyright (c) 2021 Yan Han, Marek Rychlik. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.