AI Skill Document Extraction - Couldn't extract all the data in PDF

Certified Senior Developer

Hello,

We are trying to extract the data from PDF using AI skill Extraction but not able to extract most of the data, Is there any way to improve the AI skill document extraction performance to extract complete data in PDF

Thanks

R Gopala Krishna Raju

  Discussion posts and replies are publicly visible

  • 0
    Certified Lead Developer

    Yes. The Reconciliation smart service will train the model.

    To get a more specific advice, we will need way more details.

  • 0
    Certified Senior Developer
    in reply to Stefan Helzle

    We tried using Reconcile smart service after AI skill Extraction but we couldn't find any difference in the data extraction after so many attempts

  • 0
    Certified Lead Developer
    in reply to gopalakrishnarajur8985

    Hm ... what kind of documents do you extract from? With the documents I use, it took just a few cycles to get very good results.

  • 0
    Certified Senior Developer
    in reply to Stefan Helzle

    It was an Account creation form

  • Hi Gopala,

    Below, I have included an explanation of how the underlying functionality in the Document Extraction Skill works. The short explanation is that Extraction uses a pre-trained model that Appian Engineering developed on a dataset of form-style documents to extract key-value pairs, tables, and checkboxes. As you reconcile documents, Appian learns how the keys in your documents correspond to the fields you defined in your AI Skill data structure; however, reconciliation does not actually retrain the underlying ML model, which means some fields may continue to require manual extraction. 

    The text below has more details. We are working on adding this information to our product docs so that it is more readily available to our users, so it would be great to hear any questions or feedback you have after reading this.

    ---

    More detailed explanation of AI Skill Extraction functionality in 23.2:

    First, it's important to remember that Appian document extraction is powered by pre-trained machine learning models. When you extract document data in Appian, you aren't creating a model or training one on data you provide. Instead, Appian learns about your data via the Reconciliation task.

    The document extraction process -- either within the Extract from Document or Start Doc Extraction smart services -- comprises two parts:

    1. Extract data from a PDF using a pre-trained ML model.
    2. Map the extracted data to the customer's Appian data structure for reconciliation.

    Step 1: Data extraction

    **Input**: PDF
    **Output**: Identified text, key-value pairs, checkboxes, and tables

    In the first step, the PDF goes to a pre-trained machine learning service to run optical character recognition (OCR), extract key-value pairs, and identify checkboxes and tables. The service returns all identified values. These are represented by blue bounding boxes in the Reconcile task.

    Step 2: Data mapping via reconciliation

    **Input**: Identified text, key-value pairs, checkboxes, and tables from step 1
    **Output**: Auto-extracted fields to Appian process model for use in your application

    The second step leverages previous mappings stored in the customer's environment to know which extracted data to map to the document structure. This mapping is stored in a dictionary as you complete reconciliation tasks over time.

    If your Appian environment has previously mapped values to your structured fields, Appian will leverage the previous keys to assist in mapping. A user will complete a reconciliation task to confirm that those mappings are correct. When a user maps data to a field in the reconciliation task, Appian stores the label for the key that was mapped. For example, if you provide mappings, Appian will recognize that `P.O. #`, and `PO No.` both map to the `poNumber` Appian data type field.

    Once the user submits the reconciliation task, Appian stores updated mappings in a simple dictionary of terms (keys and positions) to use next time it has to map data from the pre-trained model (output of step 1) to the structured fields in your application. Reconciliation helps Appian manage variations in semi-structured and structured forms. In this way, reconciliation helps document extraction learn more about your data.

    The model in step 1 does not get retrained. If the ML service misses a field, Appian will continue to miss that field. This means that there are forms where our auto-extraction will not extract information desired by customers. In these situations, customers can leverage manual extraction to get the last pieces of information.

  • 0
    Certified Senior Developer
    in reply to Louis Prensky

    Hi Louis,

    Step 1: Data extraction

    In this step The service returns all identified values. These are represented by blue bounding boxes in the Reconcile task but they are not mapped to the Appian data type fields

    below check boxes are identified values but not mapped to Appian data type fields even after so many cycles of Reconciliation

    can you please let us know is there any way to map identified values to Appian Data type fields automatically in Data Extraction

      

    Step 2: Data mapping via reconciliation

    In the reconciliation task we tried to map the table to Appian data type but even the tables headers are not captured after selecting the table manually 

    When we try to select table it was not recognising the table headers properly and we are mapping table headers manually to appian data type fields, This happens with the table data as well after so many cycles of reconciliation

    Below are the some tables we used in our Document and not able to select headers manually, Can you please let us know if there is any way of selecting table to recognise the headers and data.

  • 0
    Certified Associate Developer
    in reply to gopalakrishnarajur8985

    Hi Gopala,

    Yes, the document extraction is not extracting all the content in the PDF file.

    I tried using different files, I found out that if the file format was clear to understand for the machine then it is extracting almost data but sometimes it is just highlighting the content in PDF so that we have click on the data that is highlighted for the fields 

    Some times may be due to lack of understandability from PDF, It is not extracting data at that time we manually give the data and no other way for it.

    I found out this while working with this doc extraction.