Document Classifier is classifying incorrectly with high confidence score

Question

Hi Team, 
 Hi Team, 
 I&rsquo;m using the Document Classifier to distinguish between Medical Certificates and a Statutory Document. I&rsquo;ve trained the model with 47 Medical Certs and 11 Statutory Docs. The incoming documents are predominantly (80%) Medical Certs, followed by Statutory Docs and other unrelated documents. 
 The goal is to classify incoming documents and extract data using AI only if the document is a Medical Certificate. However, I&rsquo;ve observed that the model is classifying random, unrelated documents as Medical Certificates with high confidence. (note the one - "use arrays functions exercise" and "blood report") 
 Has anyone encountered a similar issue? Any suggestions on improving classification accuracy and reducing false positives? The metrics are not helping too, they are all reported as 1.000

Louis Prensky · Answer

Hi ranjitap410023 , thanks for reaching out about this! I would recommend the following: 1) Train the model with an equivalent number of documents for each document type. For example, I would add more Statuatory Docs so that there are ~50 in the training set. This is best practice, even if you expect the distribution of documents in production to be mostly Medical Certs. 2) If you expect a variety of random documents that do not need to be handled in your downstream process, create a "Random/Invalid" document type. When providing training data for this document type, try to include as broad of a set of documents as possible so that it is representative of the documents that will be ingested in production. 
 Please let me know if you have any questions!