Document Classifier failing to classify unstructured docs

Certified Associate Developer

Hi Team,

I’m using the Document Classifier to distinguish between Medical Certificates and a Statutory Document. I’ve trained the model with 21 Medical Certs and 12 Statutory Docs. The incoming documents are predominantly (80%) Medical Certs, followed by Statutory Docs and other unrelated documents.

The goal is to classify incoming documents and extract data using AI only if the document is a Medical Certificate. However, I’ve observed that the model is classifying random, unrelated documents as Medical Certificates with high confidence.(note the one - "use arrays functions exercise :(an Appian training exercise) and "blood report"

Has anyone encountered a similar issue? Any suggestions on improving classification accuracy and reducing false positives? The metrics are not helping too, I haven't seen them changing?

Is IDP classification not meant for unstructured files?

  Discussion posts and replies are publicly visible

Parents
  • Hi  cross-posting this response from a similar thread:

    Thanks for reaching out about this! I would recommend the following:
    1) Train the model with an equivalent number of documents for each document type. For example, I would add more Statuatory Docs so that there are ~50 in the training set. This is best practice, even if you expect the distribution of documents in production to be mostly Medical Certs.
    2) If you expect a variety of random documents that do not need to be handled in your downstream process, create a "Random/Invalid" document type. When providing training data for this document type, try to include as broad of a set of documents as possible so that it is representative of the documents that will be ingested in production.

    Please let me know if you have any questions!

  • 0
    Certified Associate Developer
    in reply to Louis Prensky

    Thank you! I started to that and i can see the metrics now changing! 

Reply Children
No Data