Selecting a random sample from a large table

Certified Senior Developer

Hello, 

I'm facing the following task - I need to be able to generate a unique sample from a table/view. A couple of notes:

  1. This is a large table, about 25 million rows
  2. This table will need to be filtered first, then sampled
    1. To filter, this table will need to join on other tables, which is why I'd like select the random sample from a view instead of the table
    2. The random sample would need to pull from a relatively uniform/normal distribution (basically not skewed)
    3. This random sample will need to pull from live data - there cannot be a 1 day delay or anything
  3. This table is a CDT in appian, due to point 1
  4. DB system is Oracle
  5. I don't just need a list of random numbers, but I need a sample of random PK IDs

These are the things I've tried/investigated:

  1. SAMPLE()
    1. I'm not able to get it to sample on a View, it throws some sort of primary key error.
    2. People have also reported that SAMPLE pulls from skewed distribution of data and I'm hoping for something more uniform
  2. dbms_random.value
    1. I haven't actually tried it, since people reported that it takes 3 mins to run even on a table with only 1k rows and I don't want to bring down the environment
  3. Appian rand() function
    1. )
      a!localVariables(
        local!random: ri!min + tointeger(rand() * (ri!max - ri!min)),
        local!arrray: append(ri!array, local!random),
        if(
          length(union(local!arrray, local!arrray)) >= ri!count,
          local!arrray,
          rule!KS_recursion_rand_dev_selector(
            min: ri!min,
            max: ri!max,
            count: ri!count,
            array: union(local!arrray, local!arrray)
          )
        )
      )
    2. I saw someone post this on community, which is nice, but is only good for returning a list of random numbers. I need to make sure it's a list of random primary keys. For example, after filtering, I may have the following list of primary keys - {1321, 34212, 8832, 9012, 12} - but imagine it's a list of 90k numbers. The table may continue to grow so that list may also grow. 
    3. I am able to query that view for just the primary key, use the code above to generate about 50 random indexes, then index into the long, long list of primary keys, but that takes about 30 seconds to run (and mostly because of query). If the table continues to grow, this method of sampling won't be sustainable.
      1. I need to query some data so I can guarantee I pick primary keys
  4. Adding Row_num() to the view so I can directly use the indexes randomly generated from the code in 3
    1. I can use the code to generate a list of random indexes, if I set min = 1 and max = total count of data in view
    2. However, since it's not a 1:1 of index = view primary key, I can't used the random indexes to query into the view
    3. I can add Row_num() as an additional column to the view and give the view indexes I can use
      1. However, then the view takes about 100 seconds to run, and since I need to sample from live data, performance will be a concern every time I will try to sample

Any suggestions? Thank you!

  Discussion posts and replies are publicly visible

Parents
  • 0
    Certified Lead Developer

    Why not use the TotalCount to determine the table's current size, then generate a random value between 1 and N, and query that using the generated value as the StartIndex (with a page size of 1)?  You could do this for one or several queries into a random row.

    a!localVariables(
      local!totalEntries: rule!ASDF_QRY_PersonDocuments(
        pagingInfo: a!pagingInfo(1, 0),
        fetchTotalCount: true()
      ).totalCount,
      
      local!totalSamples: 5,
      
      local!sampleIndices: a!forEach(
        enumerate(local!totalSamples),
        
        tointeger(rand()*local!totalEntries)
      ),
      
      local!sampleQueries: a!forEach(
        local!sampleIndices,
        index(rule!ASDF_QRY_PersonDocuments(
          pagingInfo: a!pagingInfo(
            startIndex: fv!item,
            batchSize: 1
          )
        ).data, 1)
      ),
      
      
      {}
    )

    This takes about 1 second to execute for a table with just shy of 700,000 entries.  I imagine (depending on the number of samples desired) that it would scale up to your 25M scope decently(?) though i'd have a hard time testing that for you, as this is my prod system's largest table.

  • 0
    Certified Senior Developer
    in reply to Mike Schmitt

    I tried that, but if the start index is big enough, Appian times out with this error:

    I'll need to query from a view since I can make sure the view has the correct fields that I need to filter on.

  • 0
    Certified Lead Developer
    in reply to kl0001

    Youch.  Then I must defer to prior suggestions to run it through a SP.

  • 0
    Certified Senior Developer
    in reply to Mike Schmitt

    What are you thinking of in terms of running it through an SP? 

  • 0
    Certified Lead Developer
    in reply to kl0001

    The SP can recreate much of the view logic you're thinking of internally and do more efficient / more advanced calculations in advance (i assume up to and including using some db-internal logic to select some samples) - since it's all done internal to the DB, it'll process faster than trying to query through Appian.

  • 0
    Certified Senior Developer
    in reply to Mike Schmitt

    Do you have examples of random functions that can be done on the db side? The only ones I've been able to find are DBMS_RANDOM.VALUE and SAMPLE() for oracle, but I've heard DBMS_RANDOM.VALUE can be very slow on large tables, and SAMPLE() tends to return skewed data. 

  • 0
    Certified Lead Developer
    in reply to kl0001

    True random values often appear "skewed" due to observer bias.  If you want something that's evenly distributed instead of truly random, you'll probably need to determine your own algorithm for that (i.e. take indexes from 1 to N divided by samples desired, then randomize each sample index by taking some deviation based on a generated random).  The DB software has loads of functions available and I don't even begin to claim to fathom them all.  If you were working in MariaDB i would be more confident in sharing some of my stored proc tricks (disclaimering that i haven't done much play with random numbers in it), but you could probably discover the ones you need by googling the Oracle DB docs.

Reply
  • 0
    Certified Lead Developer
    in reply to kl0001

    True random values often appear "skewed" due to observer bias.  If you want something that's evenly distributed instead of truly random, you'll probably need to determine your own algorithm for that (i.e. take indexes from 1 to N divided by samples desired, then randomize each sample index by taking some deviation based on a generated random).  The DB software has loads of functions available and I don't even begin to claim to fathom them all.  If you were working in MariaDB i would be more confident in sharing some of my stored proc tricks (disclaimering that i haven't done much play with random numbers in it), but you could probably discover the ones you need by googling the Oracle DB docs.

Children