Total Time taken for CSV to Database Import

niravd0001 over 7 years ago

Hello

We are using Appian 17.4 to bulk upload csv data to a sql server database. We are achieving it by using ‘Import CSV to Database’ smart service to import the data to the database.

One of the csv file has 6 columns and around 20k records. There is no data transformations.

Below 2 issues:

Access Denied Error:
While we upload the csv containing around 20k rows, the data is successfully inserted to the database. Though after successful upload, it shows ‘403 Access Denied’ error on the screen.

Total Upload time:
The time to upload the csv containing 5k rows at a time is taking around 5 minutes which seems to be a bit high. This is obvious with the reason that Appian is communicating to different server writing the data.

Is there any way or alternate to reduce the time taken to upload the data to the database?

Thanks

Nirav

Discussion posts and replies are publicly visible

0 Arpit Bhargava over 7 years ago

Hi,

You can send the data in the batches
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 ryanh over 7 years ago

Some coworkers have used some similar CSV to DB type of loading methods, though on pretty small text files (10-200 rows) with decent speed. These only run a few times a day.

I have used some DB client tools (AQT, MS SQL Management Studio, etc.) to run large sets (3k+) of 'insert' statements and they are relatively slow, similar to what you are say, and I think this is likely caused by the round trip latency between each insert statement combined with the database performing commits and logging for each row. The tools show the pre-built insert statements scrolling by and it takes about 1 minute per thousand rows I would bet.

A few years ago a contractor wrote this process which I admit isn't great:
We have Appian read from a DB2 database a set of a few hundred pipe delimited rows of data, parse the data into CDTs and then for each row in the CDT it spawns a separate Appian process. Those processes perform 1-2 select statements (on MS SQL Server) on a 3+ million row table (poorly indexed due to low cardinality data) table, determine if the data is 'new' or 'existing' and then run an insert or update statement with the data on the table (with a few XOR and script nodes in the process model). Sadly, this performs miserably, due to a combination of spawning hundreds of processes in <3 minutes (I do it in batches of 40 or so rows/processes, wait 10-15 seconds and then spawn another 40 (looping, until all are don) ) and running a few hundred database queries & updates/inserts with the 0.3-1.5 second query response times. I believe that the combination of process spawning & queries causes bottle necks on our server. We are planning on moving 99% of the older data to a history table and only retaining a few weeks worth in our 'live' table. Reporting and other business logic needing access to all of the 2.5+ years data in that table will query a view that has a union between the live & history tables. Slightly longer term we will be implementing a solution based on the next paragraph:

We have recently created a few daily processes that use IBM's Datastage product (part of their Infosphere Suite) to perform bulk data transformation/loading for ETL type jobs. Needless to say, using a product built for this type of functionality is FAR superior in speed, stability and repeatability than shoe-horning large ETL loads into Appian. Unfortunately, products like Datastage are not cheap and are probably only available to bigger businesses and due to their learning curve & user base are probably not available to your team without reaching out to another group internally. You would then need to coordinate having Appian trigger the loads in the separate tool set and receive acknowledgment when the loads complete.

Here are a few alternatives you could consider:

1) Python/Perl or other text parsing friendly languages that won't incur the cpu overhead that Appian is likely to incur (parse the CSV file and then have the program you build perform the inserts) - may require a 2nd set of application servers since Appian prefers we not run separate applications on the same systems as Appian. Depending on your security & server needs this could be an expensive or time consuming development endeavor if you have to spin up new servers and get your non-Appian code approved by your security teams/code review people.

2) have your database engine use its native bulk insertion tool set/application to perform bulk inserts (learning curve, scheduling, etc) - Could be a good solution if you have DBAs and rights to use the bulk insertion software.

3) Use Appian to read in your CSV, do NOT parse the individual rows of the file into a CDT, instead pass the CSV file as/is or with minimal parsing (remove headers/footers) to a stored procedure in your database of choice. Have the stored procedure use its native transactional sql language to parse the csv, perform cleansing of the data & transformations and then perform the insertions (with or without calling native bulk insertion logic which would be faster since it essentially turns off the logging and per-command commits).

I recommend #3 if you don't have the access to an ETL tool set or don't want to incur learning additional languages and having to call them. It lets you use Appian to 'control' the process, inline with the rest of your applications & logic. Appian should be able to read in the roughly 20k row (2-3 mb?) text file into Appian easily and then pass it as a single variable to a stored procedure pretty easily. I would think your database could loop over the CSV and parse it pretty quickly and then perform its operations. You will need to write the parsing logic in your databases' native language, but this can be done by a DBA relatively easily. Appian has a 'execute stored procedure' smartservice available that is easy to use if you aren't returning record sets or complex data back to Appian. You could have your normal Appian flow read the CSV and call the stored procedure as an asynchronous call if you don't need immediate results (or don't want to risk a time-out in the procedure call node). You could probably setup a JMS message or web service that the database might be able to signal when it completes, or have the stored procedure immediately perform an insert into a logging table with an incrementing key value and return that immediately to the calling process model. Then when the stored procedure completes it can update the log record with success/failure indicators. Have Appian ping the log table looking for that particular record and repeating (I know, this would require a loop in Appian, something Appian doesn't love) until it either sees a success/failure indicator or times out. There are probably a lot of ways for you to setup this round trip Appian to DB process for you to know when the parsing/insertion has completed. Some obviously less graceful than others. If your CSV parsing & insertion into the database takes less than 10 seconds you can probably use the parent Appian process without risking the node timing out (not sure if the node can be set for longer than the native 10 seconds, I think you can do 60). If the parsing takes longer you may have to have a more asynchronous type of process.

Either way, off loading the CSV parsing & table insertion logic from Appian to the database is going to reduce the load significantly in Appian (which presumably is NOT tuned for large batch processing) and will probably be a lot faster. Heck, I might try a proof of concept to replace my Appian parsing/process model spawning heavy load process above with a stored proc.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 niravd0001 over 7 years ago

Thanks ryanh for the detailed information.
Yea that seems to be the best approach. In that case, how is that we can pass the csv to the stored procedure? To store that in the KC? If so, how can the db get access to the file in KC?

Thanks agian
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 ryanh over 7 years ago in reply to niravd0001

I played around with this right after I posted that long email. (since I too haven't yet solved the issue)

In my case, my source data is coming from a DB2 table and I am parsing the pipe delimited data and then using the parsed data to runs some select queries, if/then logic and insert/update into MS SQL. Since I don't really start with a CSV file and I didn't want to worry about security (loading a CSV into the database system or using the native 'bulk insert' capability), I chose to query the DB2 data using a query entity, then forcing the data into a big delimited PV variable and trying to change some of the delimiters that Appian adds (brackets, semicolons with a space afterwards).

tostring(pv!db2datasubset.data.ReportData_Txt)
substitute(pv!db2delimited,"; ", "@@@")

I have a @@@ delimited array of sorts, with nested pipe delimited strings saved in my pv.

I haven't yet passed that from Appian to a stored proc. But I played with making a function that would allow me to quickly parse a pipe delimited string. My hope was that I could create a stored procedure that would loop over a single varchar(max) variable, split the single large string into essentially an array of smaller rows of data. Then, in a loop, perform the select/if/then/insert logic we need on the parsed row data.

There is a new string_split function built into MS SQL 2016. Unfortunately, we are still on 2012 (and running under the 2010 compatibility level) and there isn't a native splitting function to allow you to grab element X from a delimited list (let alone specifying the delimiter). I got slowed down looking online for fast/low overhead examples on stackexchange of splitting functions. There are some really interesting solutions to that, though some require CLR and installed executables versus native sql logic (or are slow).

Back to your question:

Pull the CSV file into a PV variable of type Text.
I haven't yet called the sql function (or made a stored procedure) directly from Appian, but I assumed that I could pass in the string into a varchar(max) variable. the (max) isn't limited to the traditional 8,000 byte length that a varchar() is limited to.

CREATE FUNCTION fn_Wordparser
(
@multiwordstring VARCHAR(max),
@wordnumber int,
@delimiter varchar(5)
)
returns VARCHAR(500)
AS
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel