This discussion is archived
1 2 Previous Next 15 Replies Latest reply: Aug 26, 2008 5:58 AM by 807589 RSS

trouble generating random numbers with a certain distribution

807589 Newbie
Currently Being Moderated
I'd like to generate a random integer (int or long) from 1 to N where, over time, I get the number 'n' 1/n times as often as the number 1. In other words, I get a 2 half as many times as a 1, a 3 one-third as many times as a 1, etc...does anyone know of a fast and accurate way to do this?

So far, I have two ways to do this, one that's efficient and accurate and one that's fast but too inaccurate for my application. They are as follows:

1. ...the inefficient but accurate way that uses [rejection sampling|http://en.wikipedia.org/wiki/Rejection_sampling] with the probability distribution function N/x:
java.util.Random rand = new java.util.Random();
int N = 100; // max int to generate
boolean bFound = false;
double dRand;
while (!bFound)
    {
    int iRandX = (int) ((rand.nextDouble() * (double) N) + 1.0); // random X
    double dRandY = (rand.nextDouble() * (double) N); // random Y

    // only take it (the x value) if y falls under the curve N / x
    if (dRandY <= ((double) N / (double) iRandX))
        {
        dRand = iRandX;
        bFound = true;
        }
    }
However, it's dog slow as many samples get rejected.

2. ...the fast but inaccurate way that I made after recognizing that the inverse of the cumulative probability distribution (forgive me if my math terms are wrong, which is likely) looks like the [harmonic series|http://en.wikipedia.org/wiki/Harmonic_series_(mathematics)], or 1/1 + 1/2 + 1/3 + ... for which there is an estimation using the [Euler-Mascheroni constant|http://en.wikipedia.org/wiki/Euler-Mascheroni_constant]:
java.util.Random rand = new java.util.Random();
double dEulerMascheroni = 0.57721566490153286060651209008240243104215933593992;
int N = 100;
double dMax = Math.log(N) + dEulerMascheroni;
    
double dNormalRand = rand.nextDouble() * (dMax - 1.0) + 1.0; // 1 to N
double dRand = Math.exp(dNormalRand - dEulerMascheroni); // integral
However, this is only an estimation, so after viewing a histogram, the frequencies of the smaller values in particular (1-10 or so) are wildly inaccurate, and for my application, accuracy in the lower values is critical.

The best thing I can come up with for now is to amend the second version with a lookup table for smaller values of N to improve the accuracy there. Fortunately, the estimation function
Math.log(N) + dEulerMascheroni
approaches an error value of 0 as N gets higher, so a lookup table only for lower values should suffice (haven't tried this yet).

Am I missing a really simple and fast way to generate this easy distribution accurately?

A lookup table with interpolation is possible, but I'd like to avoid it due to accuracy concerns.

Thanks for any advice...
  • 1. Re: trouble generating random numbers with a certain distribution
    807589 Newbie
    Currently Being Moderated
    phineasgage wrote:
    2. ...the fast but inaccurate way that I made after recognizing that the inverse of the cumulative probability distribution (forgive me if my math terms are wrong, which is likely) looks like the [harmonic series|http://en.wikipedia.org/wiki/Harmonic_series_(mathematics)], or 1/1 + 1/2 + 1/3 + ... for which there is an estimation using the [Euler-Mascheroni constant|http://en.wikipedia.org/wiki/Euler-Mascheroni_constant]:
    This formula just estimates a harmonic sum. Can't you use the exact sum?

    Say N=100. You create a table with 100 doubles. The first element holds 1, the second 1/2, the third 1/3, up to the 100'th element which holds 1/100. You then sum up all elements. In this case you get H100 = 5.18737. You then divide all elements in the array with this value.

    The above array holds the wanted probability function. If you now generate a random uniformly distributed double between 0 and 1 you can use the array to decide which element (number between 1 and 100) was selected.
  • 2. Re: trouble generating random numbers with a certain distribution
    807589 Newbie
    Currently Being Moderated
    Thanks for the reply. It's true that I could use this method for small numbers of N, but ultimately I'll run my code to N=10^9, and possibly higher.

    Perhaps I could use this table method to around N=10,000 (10k doubles in memory isn't too bad), then somehow scale it and interpolate to whatever size I want. But I've yet to determine how the error will affect my final simulation. Fortunately, the error rapidly gets smaller and smaller as you go higher, as the harmonic sum grows incredibly slowly...so linear interpolation between pre-calculated values may suffice after some point, but will have to check this out more...
  • 3. Re: trouble generating random numbers with a certain distribution
    807589 Newbie
    Currently Being Moderated
    phineasgage wrote:
    Thanks for the reply. It's true that I could use this method for small numbers of N, but ultimately I'll run my code to N=10^9, and possibly higher.

    Perhaps I could use this table method to around N=10,000 (10k doubles in memory isn't too bad), then somehow scale it and interpolate to whatever size I want. But I've yet to determine how the error will affect my final simulation. Fortunately, the error rapidly gets smaller and smaller as you go higher, as the harmonic sum grows incredibly slowly...so linear interpolation between pre-calculated values may suffice after some point, but will have to check this out more...
    The method I suggested can be improved somewhat if, instead of the probability function, the distribution function is used. The array will then be sorted so you can use a binary search. With N=10000 you'll find the value in like 30 accesses or so.

    I didn't realize you had N's of that order of magnitude. But doesn't the probability for N= 10^9 become so small that it's effectively zero compared with the smaller numbers, like say 1?

    I've had a look in a statistic's handbook but I cannot find any reference to your "harmonic" distribution. It should be classified as a "heavy tail" or "fat tail" distribution I think,

    [http://en.wikipedia.org/wiki/Heavy_tail|http://en.wikipedia.org/wiki/Heavy_tail]

    if that can be of any help.

    I'm kind of interested in "heavy tail" probability functions so I was wondering what application you're programming and how this "harmonic" distribution occurs in it?
  • 4. Re: trouble generating random numbers with a certain distribution
    807589 Newbie
    Currently Being Moderated
    uj_ wrote:
    I've had a look in a statistic's handbook but I cannot find any reference to your "harmonic" distribution. It should be classified as a "heavy tail" or "fat tail" distribution I think,
    [http://en.wikipedia.org/wiki/Harmonic_series_(mathematics)|http://en.wikipedia.org/wiki/Harmonic_series_(mathematics)]
  • 5. Re: trouble generating random numbers with a certain distribution
    807589 Newbie
    Currently Being Moderated
    Encephalopathic wrote:
    uj_ wrote:
    I've had a look in a statistic's handbook but I cannot find any reference to your "harmonic" distribution. It should be classified as a "heavy tail" or "fat tail" distribution I think,
    [http://en.wikipedia.org/wiki/Harmonic_series_(mathematics)|http://en.wikipedia.org/wiki/Harmonic_series_(mathematics)]
    Yes, yes. An harmonic series right.
  • 6. Re: trouble generating random numbers with a certain distribution
    807589 Newbie
    Currently Being Moderated
    uj_ wrote:
    Encephalopathic wrote:
    uj_ wrote:
    I've had a look in a statistic's handbook but I cannot find any reference to your "harmonic" distribution. It should be classified as a "heavy tail" or "fat tail" distribution I think,
    [http://en.wikipedia.org/wiki/Harmonic_series_(mathematics)|http://en.wikipedia.org/wiki/Harmonic_series_(mathematics)]
    Yes, yes. An harmonic series right.
    I don't think it's called a "harmonic distribution" (I looked everywhere guessing that that's what it was, but couldn't find reference to it either). I think the harmonic series only happens to describe the cumulative distribution function that I'm looking for.

    I had not heard of the heavy tail or fat tail distributions (apparently this is also called a power law distribution, which I'd seen somewhere in passing), and don't understand their math yet, but this looks as though it could be promising. According to this, it looks like this could be my cumulative distribution function, with k=-1 and alpha=1 (but then k doesn't describe the smallest value the random variable can take so I'm not sure I've got this right). I'll post with how it goes with that one...

    The application is a personal experiment with small-world networks. I'm trying to reproduce the results of Jon Kleinberg in [http://www.cs.cornell.edu/home/kleinber/swn.pdf], however unlike in the paper, I would like to use a one dimensional rather than a two dimensional network. According to the paper, a distribution of random links which is inversely proportional to the distance between the nodes that the links connect should produce optimal path lengths with greedy routing (page 4 talks about the distribution). So, I think I want 1/2 as many links of distance 2 as distance 1, 1/3 as many links of distance 3 as distance 1, etc. I'd like to simulate small to very large networks to look at the average path length. So far I tried using an exponential distribution, but I don't think it's optimal, as if I tweak its parameters it does better or worse at various sizes of networks. I think the exponential distribution is best for networks in two dimensions.
  • 7. Re: trouble generating random numbers with a certain distribution
    807589 Newbie
    Currently Being Moderated
    phineasgage wrote:
    According to the paper, a distribution of random links which is inversely proportional to the distance between the nodes that the links connect should produce optimal path lengths with greedy routing (page 4 talks about the distribution).
    Well, if one uses a continous approach (rather than discrete) it's much easier.

    The probability function is k*1/x, where x goes from 1 to N and k is a constant which normalizes the probability mass to 1 (required for a probability function)

    To get k you integrate 1/x from 1 to N. The primitive function of 1/x is ln(x) so you get ln(N) - ln(1) = ln(N) and k becomes 1/ln(N). The probability function becomes,

    P(x) = 1/ln(N) * 1/x

    Now to get the distribution function you integrate P(x) from 1 to a variable, say y. You get

    F(y) = ln(y)/ln(N)

    This is a proper distribution function. For y=1 it is 0 and then it grows monotonously to 1 when y=N.

    Now to simulate the F random distribution one can use the Inverse Transformation Method. It states that

    X = F_inverse(U)

    where U is a uniform random number between 0.0 and 1.0. If you plug a U into F_inverse you get X which is a random number from the F distribution.

    To invert F you set

    U = F(X) = ln(X)/lnN

    and solve for X. This gives you

    X = e^[ln(N)*U]

    Now first select the N constant. Then generate uniform U's (between 0.0 and 1.0) using the standard Java random generator and plug them into the above formula and you'll get random numbers from the wanted F distribution.

    If U=0.0 you get X=1 and if U=1.0 you get X=N. So all random numbers will fall within the 1 to N interval as they should.

    If you plug in U=0.5 you should get the average X which is e^[ln(N)*0.5]. With N=10 this becomes 3.16. Intuitively this looks about right. The average should be shifted towards the left because it's there you find the bulk of the probability mass of F.

    This gives you continous random numbers. If you want a discrete approxomation you should round down to an integer (like floor). But distances should be continous so you really have no need for that.

    I cannot 100% guarantee that the above is correct but it seems quite reasonable. If you decide to use it please first verify it experimentally.
  • 8. Re: trouble generating random numbers with a certain distribution
    807589 Newbie
    Currently Being Moderated
    uj_ wrote:
    Well, if one uses a continous approach (rather than discrete) it's much easier.

    The probability function is k*1/x, where x goes from 1 to N and k is a constant which normalizes the probability mass to 1 (required for a probability function)

    To get k you integrate 1/x from 1 to N. The primitive function of 1/x is ln(x) so you get ln(N) - ln(1) = ln(N) and k becomes 1/ln(N). The probability function becomes,

    P(x) = 1/ln(N) * 1/x ...
    I put this into code and it works great, as a continuous distribution. The code is as follows:
    Random rand = new Random();
    int n = 100;
    double dU = rand.nextDouble();
    double dX = Math.exp(Math.log(n) * dU); // dX is the random number on the distribution
    Much appreciation for the math and explanation. One thing I understand better now is that I wasn't creating a true probability function (one whose integral in the desired range is 1.0). I have one remaining challenge, which is to get the discrete distribution correct. If I put the results (of 1,000,000 trials) into a histogram I get this (where the first value in each line is the number, and the second was how many times it was obtained):

    1.0,149960
    2.0,88015
    3.0,62621
    4.0,48360
    5.0,39490

    Although the value 1.0 occurs twice as often as the value 2.0, the values between 1 and 2 do not occur twice as often as the values between 2 and 3, so taking the floor of the result yields the above. Using Math.round() does not help the situation, so I'm not sure yet how to take this continuous distribution and make it discrete for my range 1 - n.

    I did get my hybrid table / Euler approximation method working, and the distribution is close (though not perfect either, as I'll explain in another post).
  • 9. Re: trouble generating random numbers with a certain distribution
    807589 Newbie
    Currently Being Moderated
    phineasgage wrote:
    Although the value 1.0 occurs twice as often as the value 2.0, the values between 1 and 2 do not occur twice as often as the values between 2 and 3, so taking the floor of the result yields the above.
    It would be a very poor random number generator if a sample of 1,000,000 values yielded the exact theoretical distribution. It it normal to use the chi-squared test to test a distribution against it's theoretical distribution.
  • 10. Re: trouble generating random numbers with a certain distribution
    807589 Newbie
    Currently Being Moderated
    Following is the (partially flawed) code for a discrete random number generator that generates a number from 1 to n with the probability distribution 1/x. It does this using a lookup table for low values, and an approximation for the harmonic series for higher values that uses the Euler-Mascheroni constant.

    I should state first that if I can get the much more elegant continuous method posted by uj_ working for my discrete distribution needs, it will be much faster and cleaner. The method posted here works, but suffers from two problems:

    1. There is a discontinuity when switching between the lookup table and the Euler approximation. At the point of the discontinuity you'll get one value with a frequency about half of what it should be.

    2. Because the approximation with the Euler-Mascheroni constant produces an area that it slightly less than actual, the distribution ends earlier than it should. For example, when n is 100 and the table size is 10, the frequency of values from 95 and above are invalid.

    Higher table sizes mitigate both problems at the expense of memory consumption.
    Random rand = new Random();
    double dEulerMascheroni = 0.57721566490153286060651209008240243104215933593992;
    long n = 100;
    int iTableSize = 10;
    double[] dLookupTable = new double[iTableSize];
        
    double dLookupCurrent = 0.0;
    for(int i = 1; i < iTableSize; i++)
    {
        dLookupCurrent += (1.0 / (double) i);
        dLookupTable[i] = dLookupCurrent;
    }
    double dTotal = dLookupCurrent + ((Math.log(n - 1) + dEulerMascheroni) -
        (Math.log(iTableSize) + dEulerMascheroni));
    for(int i = 0; i < iTableSize; i++)
        dLookupTable[i] /= dTotal;
    
    double dLookupCrossoverY = dLookupTable[iTableSize-1];
    
    double dRand = rand.nextDouble();
    long lX;
    
    if (dRand < dLookupCrossoverY) // use lookup table
    {
        lX = (long) Math.exp((dTotal * dRand) - dEulerMascheroni);
        boolean bFound = false;
        while (!bFound)
        {
            int iX = (int) lX; // will fit into an int because it's in the lookup table
            double dCurrentY = dLookupTable[iX];
            if (dRand >= dLookupTable[iX])
            {
                if (iX >= (iTableSize - 1))
                    bFound = true;
                else if (dRand < dLookupTable[iX + 1])
                    bFound = true;
                else
                    lX++;
            }
            else
            {
                if (iX <= 0)
                    bFound = true;
                else
                    lX--;
            }
        }
    }
    else // use estimation with Euler-Mascheroni constant
    {
        lX = Math.exp((dTotal * dRand) - dEulerMascheroni);
    }
    // lX +1 is a number between 1 and n with desired distribution
  • 11. Re: trouble generating random numbers with a certain distribution
    807589 Newbie
    Currently Being Moderated
    phineasgage wrote:
    Much appreciation for the math and explanation. One thing I understand better now is that I wasn't creating a true probability function (one whose integral in the desired range is 1.0). I have one remaining challenge, which is to get the discrete distribution correct. If I put the results (of 1,000,000 trials) into a histogram I get this (where the first value in each line is the number, and the second was how many times it was obtained):

    1.0,149960
    2.0,88015
    3.0,62621
    4.0,48360
    5.0,39490

    Although the value 1.0 occurs twice as often as the value 2.0, the values between 1 and 2 do not occur twice as often as the values between 2 and 3, so taking the floor of the result yields the above. Using Math.round() does not help the situation, so I'm not sure yet how to take this continuous distribution and make it discrete for my range 1 - n.
    Just to check that we agree on the prerequise. If 1 occurs with probability p, then 2 should occur with p/2, and 3 with p/3, and 4 with p/4 right?

    Unfortunately I don't think the continuous distribution can be used to get an exact discrete one. The resaon I derived the continous one was because it seemed simpler and because you mentioned distances between nodes. If those distances are continous in the 1 to N interval it's the continous distribution you want.
  • 12. Re: trouble generating random numbers with a certain distribution
    807589 Newbie
    Currently Being Moderated
    uj_ wrote:
    phineasgage wrote:
    Much appreciation for the math and explanation. One thing I understand better now is that I wasn't creating a true probability function (one whose integral in the desired range is 1.0). I have one remaining challenge, which is to get the discrete distribution correct. If I put the results (of 1,000,000 trials) into a histogram I get this (where the first value in each line is the number, and the second was how many times it was obtained):

    1.0,149960
    2.0,88015
    3.0,62621
    4.0,48360
    5.0,39490

    Although the value 1.0 occurs twice as often as the value 2.0, the values between 1 and 2 do not occur twice as often as the values between 2 and 3, so taking the floor of the result yields the above. Using Math.round() does not help the situation, so I'm not sure yet how to take this continuous distribution and make it discrete for my range 1 - n.
    Just to check that we agree on the prerequise. If 1 occurs with probability p, then 2 should occur with p/2, and 3 with p/3, and 4 with p/4 right?

    Unfortunately I don't think the continuous distribution can be used to get an exact discrete one. The resaon I derived the continous one was because it seemed simpler and because you mentioned distances between nodes. If those distances are continous in the 1 to N interval it's the continous distribution you want.
    The prerequisite as you described it is correct, but the values should ideally be discrete (though as described below, it's probably close enough for this experiment).

    I'm implementing the network as a ring where nodes have an address as a long, and the distance between them is defined as the shortest difference in their addresses around the circle (wrapping around from node N to node 1). Node addresses are exactly from 1 to N, with no gaps, so I can do the simulation without creating an object for each node, but just route around the numbers in the ring and generate links as needed.

    Incidentally, the distribution you've given me and the one with the table and approximation are already giving me results. It's interesting to experiment with the percentage of random links that are made, and how that affects the results. The continuous distribution, though it may not be exact for the discrete case, produces path lengths that are within about a 1% difference (rigorous statistics aside) from those produced by the acceptance-rejection method. For a network with 100,000 nodes, with greedy routing, average path lengths when adding one random link per node using this distribution are around 42. My next task is to try a distribution of 1/x^r, where I can adjust and optimize r, but I understand enough at this point to do that one on my own.

    I'll close the question at this point as it's quite close, and the continuous distribution works well. If someone knows of an ideal method for the discrete case in the future, it would be interesting. Thanks to sabre150 for the idea about chi-squared as well. I'll use this in the future, and trust it over eyeballing the histogram.
  • 13. Re: trouble generating random numbers with a certain distribution
    807589 Newbie
    Currently Being Moderated
    The continuous distribution from uj_ works great. Although it isn't perfect in the discrete case, it can still be used if the size of N is large and the application doesn't demand perfect accuracy in the lower range.
  • 14. Re: trouble generating random numbers with a certain distribution
    807589 Newbie
    Currently Being Moderated
    phineasgage wrote:
    The continuous distribution from uj_ works great. Although it isn't perfect in the discrete case, it can still be used if the size of N is large and the application doesn't demand perfect accuracy in the lower range.
    I haven't been doing calculus for a while but I suddently realized that this,

    X = e^[ln(N)*U]

    can be further reduced to,

    X = [e^[ln(N)]]^U = N^U

    This means that the pow method can be used for the simulation code, like
    Random rand = new Random();
    int n = 100;
    double dU = rand.nextDouble();
    double dX = Math.pow(n,  dU); // dX is the random number on the distribution
    Considering that this distribution is so easy to simulate (in the continous case) it's quite astonishing that there are so few references to it. I haven't found a single one.

    And the distribution is interesting. Every number in the range 1 to N comes up with a probability that's proportional to the inverse of the number. Informally this corresponds to the notion that "the shorter/smaller/sooner the more probable, the longer/larger/later the less probable". This should be a common situation in "nature". Doesn't stuff tend to split up in many small and few large? Doesn't people prefer sooner rather than later?

    Further the distribution has a "fat tail". This means that numbers far away from the average still has a substantial probability of occuring. This is not the case in the Gaussian/Normal distribution for example where probabilities quickly get extremely small (like not even once in the lifetime of the universe).

    The distribution also is mathematically truncated. It's limited to a specific range. Again, this is not the case with the Gaussian/Normal distribution. It must be artificially truncated in principle skewing it.

    Now what should this good cigarr be called? Well I suggest the Truncated Harmonic Distribution until someone manages to Google up its given name -:)
1 2 Previous Next