Selecting a random sample from a large table

Certified Senior Developer

Hello, 

I'm facing the following task - I need to be able to generate a unique sample from a table/view. A couple of notes:

  1. This is a large table, about 25 million rows
  2. This table will need to be filtered first, then sampled
    1. To filter, this table will need to join on other tables, which is why I'd like select the random sample from a view instead of the table
    2. The random sample would need to pull from a relatively uniform/normal distribution (basically not skewed)
    3. This random sample will need to pull from live data - there cannot be a 1 day delay or anything
  3. This table is a CDT in appian, due to point 1
  4. DB system is Oracle
  5. I don't just need a list of random numbers, but I need a sample of random PK IDs

These are the things I've tried/investigated:

  1. SAMPLE()
    1. I'm not able to get it to sample on a View, it throws some sort of primary key error.
    2. People have also reported that SAMPLE pulls from skewed distribution of data and I'm hoping for something more uniform
  2. dbms_random.value
    1. I haven't actually tried it, since people reported that it takes 3 mins to run even on a table with only 1k rows and I don't want to bring down the environment
  3. Appian rand() function
    1. )
      a!localVariables(
        local!random: ri!min + tointeger(rand() * (ri!max - ri!min)),
        local!arrray: append(ri!array, local!random),
        if(
          length(union(local!arrray, local!arrray)) >= ri!count,
          local!arrray,
          rule!KS_recursion_rand_dev_selector(
            min: ri!min,
            max: ri!max,
            count: ri!count,
            array: union(local!arrray, local!arrray)
          )
        )
      )
    2. I saw someone post this on community, which is nice, but is only good for returning a list of random numbers. I need to make sure it's a list of random primary keys. For example, after filtering, I may have the following list of primary keys - {1321, 34212, 8832, 9012, 12} - but imagine it's a list of 90k numbers. The table may continue to grow so that list may also grow. 
    3. I am able to query that view for just the primary key, use the code above to generate about 50 random indexes, then index into the long, long list of primary keys, but that takes about 30 seconds to run (and mostly because of query). If the table continues to grow, this method of sampling won't be sustainable.
      1. I need to query some data so I can guarantee I pick primary keys
  4. Adding Row_num() to the view so I can directly use the indexes randomly generated from the code in 3
    1. I can use the code to generate a list of random indexes, if I set min = 1 and max = total count of data in view
    2. However, since it's not a 1:1 of index = view primary key, I can't used the random indexes to query into the view
    3. I can add Row_num() as an additional column to the view and give the view indexes I can use
      1. However, then the view takes about 100 seconds to run, and since I need to sample from live data, performance will be a concern every time I will try to sample

Any suggestions? Thank you!

  Discussion posts and replies are publicly visible

Parents
  • Why not use the TotalCount to determine the table's current size, then generate a random value between 1 and N, and query that using the generated value as the StartIndex (with a page size of 1)?  You could do this for one or several queries into a random row.

    a!localVariables(
      local!totalEntries: rule!ASDF_QRY_PersonDocuments(
        pagingInfo: a!pagingInfo(1, 0),
        fetchTotalCount: true()
      ).totalCount,
      
      local!totalSamples: 5,
      
      local!sampleIndices: a!forEach(
        enumerate(local!totalSamples),
        
        tointeger(rand()*local!totalEntries)
      ),
      
      local!sampleQueries: a!forEach(
        local!sampleIndices,
        index(rule!ASDF_QRY_PersonDocuments(
          pagingInfo: a!pagingInfo(
            startIndex: fv!item,
            batchSize: 1
          )
        ).data, 1)
      ),
      
      
      {}
    )

    This takes about 1 second to execute for a table with just shy of 700,000 entries.  I imagine (depending on the number of samples desired) that it would scale up to your 25M scope decently(?) though i'd have a hard time testing that for you, as this is my prod system's largest table.

  • 0
    Certified Senior Developer
    in reply to Mike Schmitt

    I tried that, but if the start index is big enough, Appian times out with this error:

    I'll need to query from a view since I can make sure the view has the correct fields that I need to filter on.

Reply Children