Best way to migrate a large amount of documents (70-80gbs)

Hi All,

Looking for the best way to move a large amount of documents from an external system into Appian. We are replacing an existing mainframe system, and with that we are doing a doc and data migration. These documents are broken out into a logical structure, and I tried to use the upload zip piece. But Appian will only allow for folders under 1GB. I tried to break out the folders into smaller folders, but even at 600mb a folder, the system was still getting choked up. They are in the cloud. Any ideas/approaches to executing a large document migration?

  Discussion posts and replies are publicly visible

  • Certified Lead Developer

    I'd carefully consider whether you should use Appian for this bulk storage - Appian can do a lot of things well but I don't believe a bulk legacy document storage solution is really on-target for its intended use cases.  Will every one of the 80GB of legacy documents be used actively within the Appian application, for example?  If not, you should probably consider an external bulk file storage solution and perhaps interface this with your Appian system for flexibility as-needed.

  • Certified Lead Developer

    Utilizing an integration and migrating the documents with Appian processes will be more efficient than manually loading documents. For example, you can build a process in Appian that connects to a SFTP server with the documents. The process can parse documents/folders on the SFTP server and then store them in the appropriate KB/Folder structure on the Appian side based on your business requirements. 

    Automating the process will require you to monitor the process and potentially batch process so you don't overload system resources. I would suggest setting up some test cases in lower level environments to verify your migration process functions and performs well enough to run in production. If you have active users in your system you may also want to pause migration processes during core business hours and then let them run off hours until you have worked through all the migration documents. 

    There is also a playbook article on the topic that may give you some additional information on what to look out for when setting up an automated process.

    https://community.appian.com/w/the-appian-playbook/105/bulk-legacy-document-migration-into-appian

  • We recently dumped documents (about 17 GB) from old legacy system to prod environment (Appian) using the same technique mentioned by . You can create a temp table in your environment and write record of each doc into it while transferring them using SFTP. Once you have all the info of these documents on the temp table, you can query the table (you can decide the batch size), and feed that to the actual utility (process model that does the actual doc migration). I hope this helps.

  • Certified Lead Developer

    Appian can build a great many things, but a data-warehouse or document repository is not necessarily among them.  You have between 3 and 32 shards, or process execution / analytics engine pairs, but even with 32 pairs of processing engines, you still only get one content engine per environment.  Ours is being bogged down and feeling the strain under several million objects, most of them empty folders.  We're seeing numerous incidents of nodes that query the content engines failing due to time-out and jamming processes.  Size may not play that significant a role compared to quantity.  Think carefully about how many documents the system is intended to support before it's sunsetted.

    If your users are going to be querying / utilizing these documents on a regular basis, you may want to contemplate off-site storage, and only pulling those that are actively being used into your Appian memory / storage, the same way we're contemplating it.

  • It seems like there's some related discussion going on in this thread of scaling the document management facet of Appian. I'd recommend taking a look at the official responses in the threads here https://community.appian.com/discussions/f/data/12291/number-of-documents-effect-the-performance-in-system  and here https://community.appian.com/discussions/f/data/14324/how-many-documents-can-we-store-in-appian  . Both of those threads talk more in terms of number of documents than size, but I'll add to that information by saying that there are environments in production using Appian to manage upwards of 1TB worth of documents. 

    I think the advice ericg329 gave is an excellent place to start in terms of taking advantage of Appian's content management capabilities. 

  • Certified Lead Developer
    in reply to Eliot Gerson

    Now I will say that the majority of the time, the one little content engine we have still runs like the dickens even with several, several million documents and folders.   It seems that multiple concurrent time-consuming queries on the content engine can eventually cause slowdown and node-stoppage.  If your design limits how frequently multiple users might be looking for a document at the same time, or if you limit the number of times folders get moved, renamed, created, deleted, added to knowledge centers, removed from knowledge centers, or the number of times the security changes on those objects, you'll probably feel less heat from the content engine.

    To that end, I would take the time to migrate your documents one at a time.  I would avoid concurrency as much as possible, because tiny hiccups and delays might begin to compound over that many documents, and as slowdown increases, the likelihood of a node timing out and grinding your process to a halt goes up.  Slow and steady wins the race.

  • If you're running to performance issues, I would encourage you to create a ticket with our support team. They will be able to help give suggestions based on the symptoms you're seeing. For example, they may recommend that you add additional replicas of your content engine to improve throughput if you're seeing bottlenecks.