Hi, I want to extract text from a pdf page by page, paragraph wise, which includes different headings and related contents. Then I want to store them in DB which includes two columns one is heading column and other is content column. I used getpdftext function initially, then extract function to extract the desire contents, but did not get the expected result as the heading words are also present inside contents.Anybody please help me out?
Discussion posts and replies are publicly visible
Hi,This method is very specific on the document formatFirst thing is to ask, can you use IDP? probably easier than working using the PDF plugin
Second thing, if it's all the heading that appears, you can just remove the heading with substitute(local!text, localHeading, "")
Third, if formers for some reason are not applicable, i think i will need to see a sample of the text that the getpdftext returns, it's that possible?
Hi,
First of all thank you for your response.
I think appian reccomend not to use IDP for paragraph type contents.
Actually I can't send you samples.Basically headings are like Article 1,2,3 etc. And contents are in paragraph form. Some big articles are present in multiple pages containing multiple paragraphs.And some contents also contain article 1,2,3.... inside it.I hope, you get my point.