Show simple item record

dc.contributor.authorDeng, Yuhao
dc.contributor.authorChai, Chengliang
dc.contributor.authorCao, Lei
dc.contributor.authorTang, Nan
dc.contributor.authorWang, Jiayi
dc.contributor.authorFan, Ju
dc.contributor.authorYuan, Ye
dc.contributor.authorWang, Guoren
dc.date.accessioned2024-05-30T21:14:40Z
dc.date.available2024-05-30T21:14:40Z
dc.date.issued2024-05-03
dc.identifier.citationDeng, Y., Chai, C., Cao, L., Tang, N., Wang, J., Fan, J., ... & Wang, G. (2024). MisDetect: Iterative Mislabel Detection using Early Loss. Proceedings of the VLDB Endowment, 17(6), 1159-1172.en_US
dc.identifier.issn2150-8097
dc.identifier.doi10.14778/3648160.3648161
dc.identifier.urihttp://hdl.handle.net/10150/672407
dc.description.abstractSupervised machine learning (ML) models trained on data with mislabeled instances often produce inaccurate results due to label errors. Traditional methods of detecting mislabeled instances rely on data proximity, where an instance is considered mislabeled if its label is inconsistent with its neighbors. However, it often performs poorly, because an instance does not always share the same label with its neighbors. ML-based methods instead utilize trained models to differentiate between mislabeled and clean instances. However, these methods struggle to achieve high accuracy, since the models may have already overfitted mislabeled instances. In this paper, we propose a novel framework, MisDetect, that detects mislabeled instances during model training. MisDetect leverages the early loss observation to iteratively identify and remove mislabeled instances. In this process, influence-based verification is applied to enhance the detection accuracy. Moreover, MisDetect automatically determines when the early loss is no longer effective in detecting mislabels such that the iterative detection process should terminate. Finally, for the training instances that MisDetect is still not certain about whether they are mislabeled or not, MisDetect automatically produces some pseudo labels to learn a binary classification model and leverages the generalization ability of the machine learning model to determine their status. Our experiments on 15 datasets show that MisDetect outperforms 10 baseline methods, demonstrating its effectiveness in detecting mislabeled instances.en_US
dc.language.isoenen_US
dc.publisherAssociation for Computing Machinery (ACM)en_US
dc.rightsCopyrightis held by the owner/author(s). Publication rights licensed to the VLDB Endowment. This work is licensed under the Creative Commons BY-NC-ND4.0 International License.en_US
dc.rights.urihttps://creativecommons.org/licenses/by-nc-nd/4.0/en_US
dc.titleMisDetect: Iterative Mislabel Detection using Early Lossen_US
dc.typeProceedingsen_US
dc.contributor.departmentUniversity of Arizonaen_US
dc.identifier.journalProceedings of the VLDB Endowmenten_US
dc.description.noteOpen access articleen_US
dc.description.collectioninformationThis item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at repository@u.library.arizona.edu.en_US
dc.eprint.versionFinal published versionen_US
dc.identifier.pii10.14778/3648160.3648161
dc.source.journaltitleProceedings of the VLDB Endowment
dc.source.volume17
dc.source.issue6
dc.source.beginpage1159
dc.source.endpage1172
refterms.dateFOA2024-05-30T21:14:41Z


Files in this item

Thumbnail
Name:
p1159-chai.pdf
Size:
1.964Mb
Format:
PDF
Description:
Final Published Version

This item appears in the following Collection(s)

Show simple item record

Copyrightis held by the owner/author(s). Publication rights licensed to the VLDB Endowment. This work is licensed under the Creative Commons BY-NC-ND4.0 International License.
Except where otherwise noted, this item's license is described as Copyrightis held by the owner/author(s). Publication rights licensed to the VLDB Endowment. This work is licensed under the Creative Commons BY-NC-ND4.0 International License.