Appian Community
Site
Search
Sign In/Register
Site
Search
User
DISCUSS
LEARN
SUCCESS
SUPPORT
Documentation
AppMarket
More
Cancel
I'm looking for ...
State
Not Answered
+1
person also asked this
people also asked this
Replies
7 replies
Subscribers
6 subscribers
Views
3958 views
Users
0 members are here
Share
More
Cancel
Related Discussions
Home
»
Discussions
»
Plug-Ins
Adobe PDF Document Content Searching; Does anyone have any experienci
brianm
over 12 years ago
Adobe PDF Document Content Searching;
Does anyone have any experiencing searching content within a pdf document. There has been a recent request to provide such capability and I was looking to see if there was a smart service and/or if anyone else possibly used a different approach in searching content within a specific pdf document within Appian.
Thank you in advance for your help !
Brian...
OriginalPostID-62859
OriginalPostID-62859
Discussion posts and replies are publicly visible
Parents
0
Sathya Srinivasan
Appian Employee
over 12 years ago
You can very easily build a smart service using the Apache POI library. It has lots of custom extractors for various document types. I don't have the exact code for PDF but I wrote extenders for Open office documents (i.e. docx, pptx, xlsx etc) and here is the code for extracting content from a docx file. You will use the same method for pdf. The only item you need to change is the Document interface (from XWPFDocument to the one corresponding to PDF.
package com.appiancorp.search.extend.extract;
/**
* @author sathya.srinivasan
* @description - MS XLSX document extractor
* @date 24/11/2011
*
*/
import java.io.InputStream;
import java.net.URLConnection;
import javax.activation.MimetypesFileTypeMap;
import org.apache.poi.openxml4j.opc.internal.ContentType;
import org.apache.poi.poifs.eventfilesystem.POIFSReader;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import com.appiancorp.search.extract.text.ITextExtractor;
import com.appiancorp.search.extract.text.TextExtractorException;
public class DocxTextExtractor implements ITextExtractor {
public DocxTextExtractor()
{
}
@Override
public String extract(InputStream in) throws TextExtractorException {
try{
XWPFDocument document = new XWPFDocument(in);
XWPFWordExtractor wordExtractor = new XWPFWordExtractor(document);
return wordExtractor.getText();
}catch(Exception e){
throw new TextExtractorException((new StringBuilder()).append("Exception: ").append(e.getMessage()).toString());
}
}
}
Cancel
Vote Up
0
Vote Down
Sign in to reply
Verify Answer
Cancel
Reply
0
Sathya Srinivasan
Appian Employee
over 12 years ago
You can very easily build a smart service using the Apache POI library. It has lots of custom extractors for various document types. I don't have the exact code for PDF but I wrote extenders for Open office documents (i.e. docx, pptx, xlsx etc) and here is the code for extracting content from a docx file. You will use the same method for pdf. The only item you need to change is the Document interface (from XWPFDocument to the one corresponding to PDF.
package com.appiancorp.search.extend.extract;
/**
* @author sathya.srinivasan
* @description - MS XLSX document extractor
* @date 24/11/2011
*
*/
import java.io.InputStream;
import java.net.URLConnection;
import javax.activation.MimetypesFileTypeMap;
import org.apache.poi.openxml4j.opc.internal.ContentType;
import org.apache.poi.poifs.eventfilesystem.POIFSReader;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import com.appiancorp.search.extract.text.ITextExtractor;
import com.appiancorp.search.extract.text.TextExtractorException;
public class DocxTextExtractor implements ITextExtractor {
public DocxTextExtractor()
{
}
@Override
public String extract(InputStream in) throws TextExtractorException {
try{
XWPFDocument document = new XWPFDocument(in);
XWPFWordExtractor wordExtractor = new XWPFWordExtractor(document);
return wordExtractor.getText();
}catch(Exception e){
throw new TextExtractorException((new StringBuilder()).append("Exception: ").append(e.getMessage()).toString());
}
}
}
Cancel
Vote Up
0
Vote Down
Sign in to reply
Verify Answer
Cancel
Children
No Data