I have a list of paragraph in one of those there should be two key words in separate places could be one before another for example sometimes could be a text like this " hi carls your property is ready you can verify in 96 days" or the other way around " in 96 days your new property would be ready" in this case i would like to be able to find this two key word and get the exact paragraph the key words that will always be in this scenario are "property" and "## days" cause there are some paragraph that could have one of this words but when we find them combined it is a match. i was trying with regexarrayindexfirstmatch but i could not make it work also i tried using like() function and find().
another thing all this paragraph are coming form a PDF file with 20 pages more or less.and I'm getting them with the getpdftext() function.
is there any function or something that could help achieve this efficiently cause the only way i find is looping a lot character by character in every single paragraph .
thanks in advance!
Discussion posts and replies are publicly visible
Hi,
A solution (very dependent on the PDF formatting,tho) is to try to split the paragraphs using split(text,char(10)). If this works for you pdf, you could then loop on the paragraphs and check for both keyword in an AND statement to find the matching paragraph
yes i got you but in my case is not like i'm looking for an especific number could be any number from 0 to 999 for example then is follows by the word days and and also the paragraph need to look for the other word
Then one of the Find can be replaced by a regexmatch expression, to find number+" fixedstring", it seems that you have access to the plugin, right?
a!localVariables( /*local!test:*/ /*"hiugjkindiaChinaFrance (8) days aklsjdf;lajsdlkfja",*/ /*local!prueba: regexfirstmatch( "^[(]\d{1,3}[)]\s(days)$", local!test ),*/ local!a: "sECTION 7.6 The Accountant. The Partners shall agree upon a mutually acceptable accountant to be the initial accountant and auditor for the Company and each Subsidiary Entity (the “Accountant”). The fees and expenses of the Accountant shall be a Company expense. SECTION 7.7 Company Audit. Subject to Section 5.4(c) and Section 7.4(e), Investor Partner (or its Affiliate) is hereby designated as the partnership representative of the Company, in accordance with Section 6223 of the Code and any similar provision under any state or local tax laws, and all decisions and elections of Investor Partner as such are subject to General, Partner’s prior written consent (not to be unreasonably withheld, conditioned or delayed). The partnership representative, subject in all cases to the provisions of Section 6.2(a)(xxi) and Section 6.2(a)(xxii), shall be authorized to take any actions necessary with respect to any audit, examination or investigation (including any judicial or administrative proceeding) of the Company by any U.S. federal, state or local or non-U.S. taxing authority. Each Partner shall keep the other Partners informed of the progress of any tax audits or examinations. Each Partner shall give prompt notice to each other Partner of any and all notices it receives from the Internal Revenue Service concerning the Company or any Subsidiary Entity, including any notice of audit, any notice of action with respect to a revenue agent’s report, any notice of a thirty (30) days appeal letter and any notice of a deficiency in tax concerning the Company’s and or any Subsidiary Entity’s federal income tax return. Each Partner shall, at the Company’s expense, furnish each other Partner with status hiugjkindiaChinaFrance (8) days aklsjdf;lajsdlkfja; a,sdjflkajs (88) days dlk akjsdl;fja", local!b:split( local!a , char(10)&char(10) ), local!arrayvars:a!forEach( items: local!b, expression: if( and(regexmatch( "^[(]\d{1,3}[)]\s(days)$", fv!item )=true(), find( "hiugjkindiaChinaFrance", fv!item,0 )>0 ) , fv!item , {} ) ) , local!arrayvars )
i made this code for testing purpose but is not working for the regex plugin or either i have some trouble using it cause it never matches
sorry i was looking for it in the wrong way now i fixed it with this regexfirstmatch( ".*[(]\d{1,3}[)]\s(days).*", local!test ), but a i have a question how do you see an implementation like this in a production environment having documents with hundreds of pages and hundreds of user uploading pdf documents trying to extract this kind of data
If the documents aren't fairly big (20 pages as in the first post) this should be manageable. However this is not something that scales well. The next step could be to use RPA or IDP, this solution scale better on repetitive and large tasks