Extracting Text from a pdf paragraph wise

Hi, 
I want to extract text from a pdf page by page, paragraph wise, which includes different headings and related contents.
Then I want to store them in DB which includes two columns one is heading column and other is content column.
I used getpdftext function initially, then extract function to extract the desire contents, but did not get the expected result as the heading words are also present inside contents.
Anybody please help me out?

  Discussion posts and replies are publicly visible

Parents
  • Hi,

    This method is very specific on the document format

    First thing is to ask, can you use IDP? probably easier than working using the PDF plugin

    Second thing, if it's all the heading that appears, you can just remove the heading with substitute(local!text, localHeading, "")

    Third, if formers for some reason are not applicable, i think i will need to see a sample of the text that the getpdftext returns, it's that possible?

Reply
  • Hi,

    This method is very specific on the document format

    First thing is to ask, can you use IDP? probably easier than working using the PDF plugin

    Second thing, if it's all the heading that appears, you can just remove the heading with substitute(local!text, localHeading, "")

    Third, if formers for some reason are not applicable, i think i will need to see a sample of the text that the getpdftext returns, it's that possible?

Children