RE: AI Skill - Document Extraction

Louis Prensky — Tue, 28 Nov 2023 23:03:27 GMT

Hi , in 23.4, you should be able to train a model that can extract data from documents with varying structures. To get good results, it is important that you provide a dataset that is representative of the formats you expect to see in production. However, the model does not necessarily need to have been trained on a document template to extract data from it; if it has seen a wide variety of examples during training time, it should be able to extrapolate to new examples. Here is our documentation on building a representative dataset. I have also reached out to you directly to discuss your use case in more detail. However, is correct that you may get more consistent results by training models on specific templates if there is a subset of templates that accounts for a large portion of your overall volume.

RE: AI Skill - Document Extraction

Stefan Helzle — Tue, 28 Nov 2023 09:21:49 GMT

Documents having bigger deviations in their structure are alway a challenge. I assume that not every document is different. Can you build classes and then use a classification model to identify variants? Then build an extraction for each class. That might be able to cover at least a bigger part of the documents.