Hi Team,
I’m using the Document Classifier to distinguish between Medical Certificates and a Statutory Document. I’ve trained the model with 47 Medical Certs and 11 Statutory Docs. The incoming documents are predominantly (80%) Medical Certs, followed by Statutory Docs and other unrelated documents.
The goal is to classify incoming documents and extract data using AI only if the document is a Medical Certificate. However, I’ve observed that the model is classifying random, unrelated documents as Medical Certificates with high confidence.(note the one - "use arrays functions exercise" and "blood report")
Has anyone encountered a similar issue? Any suggestions on improving classification accuracy and reducing false positives? The metrics are not helping too, they are all reported as 1.000
Discussion posts and replies are publicly visible
Hi Team - Any help here? Is IDP classification not meant for unstructured data files like medical certificate, death certificate etc.?
Hi ranjitap410023 , thanks for reaching out about this! I would recommend the following:1) Train the model with an equivalent number of documents for each document type. For example, I would add more Statuatory Docs so that there are ~50 in the training set. This is best practice, even if you expect the distribution of documents in production to be mostly Medical Certs.2) If you expect a variety of random documents that do not need to be handled in your downstream process, create a "Random/Invalid" document type. When providing training data for this document type, try to include as broad of a set of documents as possible so that it is representative of the documents that will be ingested in production.
Please let me know if you have any questions!