Hi Team,
I’m using the Document Classifier to distinguish between Medical Certificates and a Statutory Document. I’ve trained the model with 21 Medical Certs and 12 Statutory Docs. The incoming documents are predominantly (80%) Medical Certs, followed by Statutory Docs and other unrelated documents.
The goal is to classify incoming documents and extract data using AI only if the document is a Medical Certificate. However, I’ve observed that the model is classifying random, unrelated documents as Medical Certificates with high confidence.(note the one - "use arrays functions exercise :(an Appian training exercise) and "blood report"
Has anyone encountered a similar issue? Any suggestions on improving classification accuracy and reducing false positives? The metrics are not helping too, I haven't seen them changing?
Is IDP classification not meant for unstructured files?
Discussion posts and replies are publicly visible
Hi ranjitap410023 cross-posting this response from a similar thread:
Thanks for reaching out about this! I would recommend the following:1) Train the model with an equivalent number of documents for each document type. For example, I would add more Statuatory Docs so that there are ~50 in the training set. This is best practice, even if you expect the distribution of documents in production to be mostly Medical Certs.2) If you expect a variety of random documents that do not need to be handled in your downstream process, create a "Random/Invalid" document type. When providing training data for this document type, try to include as broad of a set of documents as possible so that it is representative of the documents that will be ingested in production.
Please let me know if you have any questions!
Thank you! I started to that and i can see the metrics now changing!