AI Skill - Document Extraction

Certified Senior Developer

Hi Appian Community, hope you are doing well
I have the following use case, The solution should be intelligent enough to capture the payslip documents from the end user and extract the data from the payslip document. As with Appian 23.4 release the AI skill has became much more smarter and works like a charm for documents that have similar structure. But, in this particular use case each user can have a different format of payslip that will be uploaded to the system, although all the payslips will be pdf and will be a structured document, but again can vary in terms of format. What is the best way to train the model so it can capture data from these payslips. Also, in case it is not possible can you please suggest a workaround as well.

Regards

  Discussion posts and replies are publicly visible

Parents
  • Hi  , in 23.4, you should be able to train a model that can extract data from documents with varying structures. To get good results, it is important that you provide a dataset that is representative of the formats you expect to see in production. However, the model does not necessarily need to have been trained on a document template to extract data from it; if it has seen a wide variety of examples during training time, it should be able to extrapolate to new examples. Here is our documentation on building a representative dataset. I have also reached out to you directly to discuss your use case in more detail. However,  is correct that you may get more consistent results by training models on specific templates if there is a subset of templates that accounts for a large portion of your overall volume.

Reply
  • Hi  , in 23.4, you should be able to train a model that can extract data from documents with varying structures. To get good results, it is important that you provide a dataset that is representative of the formats you expect to see in production. However, the model does not necessarily need to have been trained on a document template to extract data from it; if it has seen a wide variety of examples during training time, it should be able to extrapolate to new examples. Here is our documentation on building a representative dataset. I have also reached out to you directly to discuss your use case in more detail. However,  is correct that you may get more consistent results by training models on specific templates if there is a subset of templates that accounts for a large portion of your overall volume.

Children
No Data