java.util.Random rand = new java.util.Random();
int N = 100; // max int to generate
boolean bFound = false;
double dRand;
while (!bFound)
{
int iRandX = (int) ((rand.nextDouble() * (double) N) + 1.0); // random X
double dRandY = (rand.nextDouble() * (double) N); // random Y
// only take it (the x value) if y falls under the curve N / x
if (dRandY <= ((double) N / (double) iRandX))
{
dRand = iRandX;
bFound = true;
}
}
However, it's dog slow as many samples get rejected.java.util.Random rand = new java.util.Random();
double dEulerMascheroni = 0.57721566490153286060651209008240243104215933593992;
int N = 100;
double dMax = Math.log(N) + dEulerMascheroni;
double dNormalRand = rand.nextDouble() * (dMax - 1.0) + 1.0; // 1 to N
double dRand = Math.exp(dNormalRand - dEulerMascheroni); // integral
However, this is only an estimation, so after viewing a histogram, the frequencies of the smaller values in particular (1-10 or so) are wildly inaccurate, and for my application, accuracy in the lower values is critical. Math.log(N) + dEulerMascheroni
approaches an error value of 0 as N gets higher, so a lookup table only for lower values should suffice (haven't tried this yet).phineasgage wrote:This formula just estimates a harmonic sum. Can't you use the exact sum?
2. ...the fast but inaccurate way that I made after recognizing that the inverse of the cumulative probability distribution (forgive me if my math terms are wrong, which is likely) looks like the [harmonic series|http://en.wikipedia.org/wiki/Harmonic_series_(mathematics)], or 1/1 + 1/2 + 1/3 + ... for which there is an estimation using the [Euler-Mascheroni constant|http://en.wikipedia.org/wiki/Euler-Mascheroni_constant]:
phineasgage wrote:The method I suggested can be improved somewhat if, instead of the probability function, the distribution function is used. The array will then be sorted so you can use a binary search. With N=10000 you'll find the value in like 30 accesses or so.
Thanks for the reply. It's true that I could use this method for small numbers of N, but ultimately I'll run my code to N=10^9, and possibly higher.
Perhaps I could use this table method to around N=10,000 (10k doubles in memory isn't too bad), then somehow scale it and interpolate to whatever size I want. But I've yet to determine how the error will affect my final simulation. Fortunately, the error rapidly gets smaller and smaller as you go higher, as the harmonic sum grows incredibly slowly...so linear interpolation between pre-calculated values may suffice after some point, but will have to check this out more...
uj_ wrote:[http://en.wikipedia.org/wiki/Harmonic_series_(mathematics)|http://en.wikipedia.org/wiki/Harmonic_series_(mathematics)]
I've had a look in a statistic's handbook but I cannot find any reference to your "harmonic" distribution. It should be classified as a "heavy tail" or "fat tail" distribution I think,
Encephalopathic wrote:Yes, yes. An harmonic series right.uj_ wrote:[http://en.wikipedia.org/wiki/Harmonic_series_(mathematics)|http://en.wikipedia.org/wiki/Harmonic_series_(mathematics)]
I've had a look in a statistic's handbook but I cannot find any reference to your "harmonic" distribution. It should be classified as a "heavy tail" or "fat tail" distribution I think,
uj_ wrote:I don't think it's called a "harmonic distribution" (I looked everywhere guessing that that's what it was, but couldn't find reference to it either). I think the harmonic series only happens to describe the cumulative distribution function that I'm looking for.Encephalopathic wrote:Yes, yes. An harmonic series right.uj_ wrote:[http://en.wikipedia.org/wiki/Harmonic_series_(mathematics)|http://en.wikipedia.org/wiki/Harmonic_series_(mathematics)]
I've had a look in a statistic's handbook but I cannot find any reference to your "harmonic" distribution. It should be classified as a "heavy tail" or "fat tail" distribution I think,
phineasgage wrote:Well, if one uses a continous approach (rather than discrete) it's much easier.
According to the paper, a distribution of random links which is inversely proportional to the distance between the nodes that the links connect should produce optimal path lengths with greedy routing (page 4 talks about the distribution).
uj_ wrote:I put this into code and it works great, as a continuous distribution. The code is as follows:
Well, if one uses a continous approach (rather than discrete) it's much easier.
The probability function is k*1/x, where x goes from 1 to N and k is a constant which normalizes the probability mass to 1 (required for a probability function)
To get k you integrate 1/x from 1 to N. The primitive function of 1/x is ln(x) so you get ln(N) - ln(1) = ln(N) and k becomes 1/ln(N). The probability function becomes,
P(x) = 1/ln(N) * 1/x ...
Random rand = new Random();
int n = 100;
double dU = rand.nextDouble();
double dX = Math.exp(Math.log(n) * dU); // dX is the random number on the distribution
Much appreciation for the math and explanation. One thing I understand better now is that I wasn't creating a true probability function (one whose integral in the desired range is 1.0). I have one remaining challenge, which is to get the discrete distribution correct. If I put the results (of 1,000,000 trials) into a histogram I get this (where the first value in each line is the number, and the second was how many times it was obtained):phineasgage wrote:It would be a very poor random number generator if a sample of 1,000,000 values yielded the exact theoretical distribution. It it normal to use the chi-squared test to test a distribution against it's theoretical distribution.
Although the value 1.0 occurs twice as often as the value 2.0, the values between 1 and 2 do not occur twice as often as the values between 2 and 3, so taking the floor of the result yields the above.
Random rand = new Random();
double dEulerMascheroni = 0.57721566490153286060651209008240243104215933593992;
long n = 100;
int iTableSize = 10;
double[] dLookupTable = new double[iTableSize];
double dLookupCurrent = 0.0;
for(int i = 1; i < iTableSize; i++)
{
dLookupCurrent += (1.0 / (double) i);
dLookupTable[i] = dLookupCurrent;
}
double dTotal = dLookupCurrent + ((Math.log(n - 1) + dEulerMascheroni) -
(Math.log(iTableSize) + dEulerMascheroni));
for(int i = 0; i < iTableSize; i++)
dLookupTable[i] /= dTotal;
double dLookupCrossoverY = dLookupTable[iTableSize-1];
double dRand = rand.nextDouble();
long lX;
if (dRand < dLookupCrossoverY) // use lookup table
{
lX = (long) Math.exp((dTotal * dRand) - dEulerMascheroni);
boolean bFound = false;
while (!bFound)
{
int iX = (int) lX; // will fit into an int because it's in the lookup table
double dCurrentY = dLookupTable[iX];
if (dRand >= dLookupTable[iX])
{
if (iX >= (iTableSize - 1))
bFound = true;
else if (dRand < dLookupTable[iX + 1])
bFound = true;
else
lX++;
}
else
{
if (iX <= 0)
bFound = true;
else
lX--;
}
}
}
else // use estimation with Euler-Mascheroni constant
{
lX = Math.exp((dTotal * dRand) - dEulerMascheroni);
}
// lX +1 is a number between 1 and n with desired distribution
phineasgage wrote:Just to check that we agree on the prerequise. If 1 occurs with probability p, then 2 should occur with p/2, and 3 with p/3, and 4 with p/4 right?
Much appreciation for the math and explanation. One thing I understand better now is that I wasn't creating a true probability function (one whose integral in the desired range is 1.0). I have one remaining challenge, which is to get the discrete distribution correct. If I put the results (of 1,000,000 trials) into a histogram I get this (where the first value in each line is the number, and the second was how many times it was obtained):
1.0,149960
2.0,88015
3.0,62621
4.0,48360
5.0,39490
Although the value 1.0 occurs twice as often as the value 2.0, the values between 1 and 2 do not occur twice as often as the values between 2 and 3, so taking the floor of the result yields the above. Using Math.round() does not help the situation, so I'm not sure yet how to take this continuous distribution and make it discrete for my range 1 - n.
uj_ wrote:The prerequisite as you described it is correct, but the values should ideally be discrete (though as described below, it's probably close enough for this experiment).phineasgage wrote:Just to check that we agree on the prerequise. If 1 occurs with probability p, then 2 should occur with p/2, and 3 with p/3, and 4 with p/4 right?
Much appreciation for the math and explanation. One thing I understand better now is that I wasn't creating a true probability function (one whose integral in the desired range is 1.0). I have one remaining challenge, which is to get the discrete distribution correct. If I put the results (of 1,000,000 trials) into a histogram I get this (where the first value in each line is the number, and the second was how many times it was obtained):
1.0,149960
2.0,88015
3.0,62621
4.0,48360
5.0,39490
Although the value 1.0 occurs twice as often as the value 2.0, the values between 1 and 2 do not occur twice as often as the values between 2 and 3, so taking the floor of the result yields the above. Using Math.round() does not help the situation, so I'm not sure yet how to take this continuous distribution and make it discrete for my range 1 - n.
Unfortunately I don't think the continuous distribution can be used to get an exact discrete one. The resaon I derived the continous one was because it seemed simpler and because you mentioned distances between nodes. If those distances are continous in the 1 to N interval it's the continous distribution you want.
phineasgage wrote:I haven't been doing calculus for a while but I suddently realized that this,
The continuous distribution from uj_ works great. Although it isn't perfect in the discrete case, it can still be used if the size of N is large and the application doesn't demand perfect accuracy in the lower range.
Random rand = new Random();
int n = 100;
double dU = rand.nextDouble();
double dX = Math.pow(n, dU); // dX is the random number on the distribution
Considering that this distribution is so easy to simulate (in the continous case) it's quite astonishing that there are so few references to it. I haven't found a single one.