Selecting a random sample from a large table

Question

Hello, 
 I'm facing the following task - I need to be able to generate a unique sample from a table/view. A couple of notes: 
 
 This is a large table, about 25 million rows 
 This table will need to be filtered first, then sampled
 
 To filter, this table will need to join on other tables, which is why I'd like select the random sample from a view instead of the table 
 The random sample would need to pull from a relatively uniform/normal distribution (basically not skewed) 
 This random sample will need to pull from live data - there cannot be a 1 day delay or anything

This table is a CDT in appian, due to point 1 
 DB system is Oracle 
 I don't just need a list of random numbers, but I need a sample of random PK IDs 
 
 These are the things I've tried/investigated: 
 
 SAMPLE()
 
 I'm not able to get it to sample on a View, it throws some sort of primary key error. 
 People have also reported that SAMPLE pulls from skewed distribution of data and I'm hoping for something more uniform

dbms_random.value
 
 I haven't actually tried it, since people reported that it takes 3 mins to run even on a table with only 1k rows and I don't want to bring down the environment

Appian rand() function
 
 ) a!localVariables(
 local!random: ri!min + tointeger(rand() * (ri!max - ri!min)),
 local!arrray: append(ri!array, local!random),
 if(
 length(union(local!arrray, local!arrray)) >= ri!count,
 local!arrray,
 rule!KS_recursion_rand_dev_selector(
 min: ri!min,
 max: ri!max,
 count: ri!count,
 array: union(local!arrray, local!arrray)
 )
 )
) 
 I saw someone post this on community, which is nice, but is only good for returning a list of random numbers. I need to make sure it's a list of random primary keys. For example, after filtering, I may have the following list of primary keys - {1321, 34212, 8832, 9012, 12} - but imagine it's a list of 90k numbers. The table may continue to grow so that list may also grow. 
 I am able to query that view for just the primary key, use the code above to generate about 50 random indexes, then index into the long, long list of primary keys, but that takes about 30 seconds to run (and mostly because of query). If the table continues to grow, this method of sampling won't be sustainable.
 
 I need to query some data so I can guarantee I pick primary keys

Adding Row_num() to the view so I can directly use the indexes randomly generated from the code in 3
 
 I can use the code to generate a list of random indexes, if I set min = 1 and max = total count of data in view 
 However, since it's not a 1:1 of index = view primary key, I can't used the random indexes to query into the view 
 I can add Row_num() as an additional column to the view and give the view indexes I can use
 
 However, then the view takes about 100 seconds to run, and since I need to sample from live data, performance will be a concern every time I will try to sample

Any suggestions? Thank you!

Soma · Answer

Please consider a stored procedure for calculation and logical manipulation. It gives more options and flexibility than a view. You can have the stored procedure to collect the sample and dump it in sample table. From which appian can take the ids to query and act upon.

Stefan Helzle · Answer

I suggest to implement this in a stored procedure and call it from Appian.

Soma · Answer

Try using DBMS_RANDOM.VALUE 
 SELECT *
 FROM (SELECT t1.column1, t2.column2, t3.column3, t4.column4, -- Specify the columns you need
 DBMS_RANDOM.VALUE AS random_val -- Generate a random number for each row
 FROM table1 t1
 JOIN table2 t2 ON t1.common_key = t2.common_key -- Adjust this join condition
 JOIN table3 t3 ON t2.common_key = t3.common_key -- Adjust this join condition
 JOIN table4 t4 ON t3.common_key = t4.common_key -- Adjust this join condition
 ORDER BY DBMS_RANDOM.VALUE) -- Order rows randomly
 WHERE ROWNUM <= :sample_size';

Mike Schmitt · Answer

Why not use the TotalCount to determine the table's current size, then generate a random value between 1 and N, and query that using the generated value as the StartIndex (with a page size of 1)? You could do this for one or several queries into a random row. 
 a!localVariables(
 local!totalEntries: rule!ASDF_QRY_PersonDocuments(
 pagingInfo: a!pagingInfo(1, 0),
 fetchTotalCount: true()
 ).totalCount,
 
 local!totalSamples: 5,
 
 local!sampleIndices: a!forEach(
 enumerate(local!totalSamples),
 
 tointeger(rand()*local!totalEntries)
 ),
 
 local!sampleQueries: a!forEach(
 local!sampleIndices,
 index(rule!ASDF_QRY_PersonDocuments(
 pagingInfo: a!pagingInfo(
 startIndex: fv!item,
 batchSize: 1
 )
 ).data, 1)
 ),

{}
) 
 This takes about 1 second to execute for a table with just shy of 700,000 entries. I imagine (depending on the number of samples desired) that it would scale up to your 25M scope decently(?) though i'd have a hard time testing that for you, as this is my prod system's largest table.

Mike Schmitt · Answer

True random values often appear "skewed" due to observer bias. If you want something that's evenly distributed instead of truly random, you'll probably need to determine your own algorithm for that (i.e. take indexes from 1 to N divided by samples desired, then randomize each sample index by taking some deviation based on a generated random). The DB software has loads of functions available and I don't even begin to claim to fathom them all. If you were working in MariaDB i would be more confident in sharing some of my stored proc tricks (disclaimering that i haven't done much play with random numbers in it), but you could probably discover the ones you need by googling the Oracle DB docs.