Appian Community
Site
Search
Sign In/Register
Site
Search
User
DISCUSS
LEARN
SUCCESS
SUPPORT
Documentation
AppMarket
More
Cancel
I'm looking for ...
State
Not Answered
+1
person also asked this
people also asked this
Replies
7 replies
Subscribers
6 subscribers
Views
3955 views
Users
0 members are here
Share
More
Cancel
Related Discussions
Home
»
Discussions
»
Plug-Ins
Adobe PDF Document Content Searching; Does anyone have any experienci
brianm
over 12 years ago
Adobe PDF Document Content Searching;
Does anyone have any experiencing searching content within a pdf document. There has been a recent request to provide such capability and I was looking to see if there was a smart service and/or if anyone else possibly used a different approach in searching content within a specific pdf document within Appian.
Thank you in advance for your help !
Brian...
OriginalPostID-62859
OriginalPostID-62859
Discussion posts and replies are publicly visible
0
larson.thune
Appian Employee
over 12 years ago
Brian,
There is a smart service plugin called "Parse PDF Plugin" that may be able to accomplish this task. I have not worked personally with this plugin, but I believe the PDF must have form elements in order for the values to be parsed (as opposed to a scanned PDF, say). Hope this helps!
Cancel
Vote Up
0
Vote Down
Sign in to reply
Verify Answer
Cancel
0
brianm
over 12 years ago
Larson;
Thank you for the feedback. I was looking at this smart service, it appears it would be helpful as some pdf's may be based on field entry / form fill. I haven’t installed it, but it appears the problem may be that we will also have documents printed to .pdf and looking for a method to search .pdf content based on key words that are entered into a search option.
The business case: There are multiple documents that will be written to a sharepoint repository. Similar to a knowledge repository concept, we are looking to scan / search pdf’s and possibly bring back documents that fit criteria for review / additional meta-data entry. With these documents being reviewed/worked, they will follow the applicable business process i.e. approval flows, etc
Let me know if this makes sense and/or if you think it is still possible with the Parse pdf pluggin.
Thanks for your input
Brian
Cancel
Vote Up
0
Vote Down
Sign in to reply
Verify Answer
Cancel
0
Sathya Srinivasan
Appian Employee
over 12 years ago
You can very easily build a smart service using the Apache POI library. It has lots of custom extractors for various document types. I don't have the exact code for PDF but I wrote extenders for Open office documents (i.e. docx, pptx, xlsx etc) and here is the code for extracting content from a docx file. You will use the same method for pdf. The only item you need to change is the Document interface (from XWPFDocument to the one corresponding to PDF.
package com.appiancorp.search.extend.extract;
/**
* @author sathya.srinivasan
* @description - MS XLSX document extractor
* @date 24/11/2011
*
*/
import java.io.InputStream;
import java.net.URLConnection;
import javax.activation.MimetypesFileTypeMap;
import org.apache.poi.openxml4j.opc.internal.ContentType;
import org.apache.poi.poifs.eventfilesystem.POIFSReader;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import com.appiancorp.search.extract.text.ITextExtractor;
import com.appiancorp.search.extract.text.TextExtractorException;
public class DocxTextExtractor implements ITextExtractor {
public DocxTextExtractor()
{
}
@Override
public String extract(InputStream in) throws TextExtractorException {
try{
XWPFDocument document = new XWPFDocument(in);
XWPFWordExtractor wordExtractor = new XWPFWordExtractor(document);
return wordExtractor.getText();
}catch(Exception e){
throw new TextExtractorException((new StringBuilder()).append("Exception: ").append(e.getMessage()).toString());
}
}
}
Cancel
Vote Up
0
Vote Down
Sign in to reply
Verify Answer
Cancel
0
elizabeth.epstein
over 12 years ago
If your PDFs are searchable/indexable by Appian (
forum.appian.com/.../Search
,
forum.appian.com/.../Configuring_Search)
then you can use Content Search Services (Shared Components --> Smart Services) to search for a keyword and return results within a process. The discussion thread associated with that component has some more information about query possibilities.
I know that PDFs are theoretically indexable by Lucene but there have been issues in the past. If you follow search config instructions and can't search for text within your PDF (an unlocked PDF that is searchable by, say, google or windows) then post back.
Cancel
Vote Up
0
Vote Down
Sign in to reply
Verify Answer
Cancel
0
brianm
over 12 years ago
The Search Content Service appears to just search within Appian Document Management; trying to think of possible options.
In our case we are most likely going to utilize SharePoint for document storage, as it is bundled very well with Project Server, in which are both utilized pretty heavy in some business areas. Utilization of sharepoint is also to assure centralization of ‘other’ content, with the defined ‘searchable’ content.
In reading some of the posts it appears there are a couple options; (wanted to confirm/verify)
1. We could build a smart service to interact directly with Apache POI library (per Sathya’s comments / similar to the docx example). This would enable us to search directly between Appian and Sharepoint repository.
2. We could utilize the content search capability, but it appears that the search only works with the Appian document repository. We would then possibly need this think of a way to sync content between the Appian & SharePoint repository for this searchable content. In utilizing Sync, this would enable content to be available in both locations, but double storage foot print
Please advise if I missed anything
Thank you all for your help and feedback on this one
Brian
Cancel
Vote Up
0
Vote Down
Sign in to reply
Verify Answer
Cancel
0
venkateshr
over 12 years ago
Brian, In line with what Sathya has mentioned about the Apache POI capabilities, there are specific libraries that support PDF data extraction. You can take a look at PDF Box (which is an Apache project specifically for PDF data extraction).
Cancel
Vote Up
0
Vote Down
Sign in to reply
Verify Answer
Cancel
0
shelzle
over 12 years ago
We also do a lot of interaction Appian <-> SharePoint. Most of the time I read data from SharePoint Lists. I think SP is pretty good in indexing documents. Why don't you ask SP for matching documents. I found this
sharepoint.stackexchange.com/.../query-the-sharepoint-search-index-externally
These webservices may not be callable with the OOTB webservice node but you can build a plugin or call the WS "by hand" using the HTML request node.
Cancel
Vote Up
0
Vote Down
Sign in to reply
Verify Answer
Cancel