python - Random is barely random at all? -
i did test randomness of randint:
>>> random import randint >>> >>> uniques = [] >>> in range(4500): # can see optimistic. ... x = randint(500, 5000) ... if x in uniques: ... raise exception('we duped %d @ iteration number %d' % (x, i)) ... uniques.append(x) ... traceback (most recent call last): file "<stdin>", line 4, in <module> exception: duped 887 @ iteration number 7
i tried 10 times more , best result got 121 iterations before repeater. best sort of result can standard library?
the birthday paradox, or why prngs produce duplicates more might think.
there couple of issues @ play in op's problem. 1 birthday paradox mentioned above , second nature of generating, not inherently guarantee given number not repeated.
the birthday paradox applies given value can occur more once during period of generator - , therefore duplicates can happen within sample of values. effect of birthday paradox real likelihood of getting such duplicates quite significant , average period between them smaller 1 might otherwise have thought. dissonance between percived , actual probabilities makes birthday paradox example example of cognitive bias, naive intuitive estimate wildly wrong.
a quick primer on pseudo random number generators (prngs)
the first part of problem taking exposed value of random number generator , converting smaller number, space of possible values reduced. although pseudo-random number generators not repeat values during period transformation changes domain smaller one. smaller domain invalidates 'no repeats' condition can expect significant likelihood of repeats.
some algorithms, such linear congruential prng (a'=ax|m
) do guarantee uniqueness entire period. in lcg generated value contains entire state of accumulator , no additional state held. generator deterministic , cannot repeat number within period - given accumulator value can imply 1 possible successive value. therefore, each value can occur once within period of generator. however, period of such prng relatively small - 2^30 typical implementations of lcg algorithm - , cannot possibly larger number of distinct values.
not prng algorithms share characteristic; can repeat given value within period. in op's problem mersenne twister algorithm (used in python's random module) has long period - greater 2^32. unlike linear congruential prng, result not purely function of previous output value accumulator contains additional state. 32 bit integer output , period of ~2^19937 cannot possibly provide such guarantee.
the mersenne twister popular algorithm prngs because has statistical , geometric properties , long period - desirable characteristics prng used on simulation models.
good statistical properties mean numbers generated algorithm evenly distributed no numbers having higher probability of appearing others. poor statistical properties produce unwanted skew in results.
good geometric properies mean sets of n numbers not lie on hyperplane in n dimensional space. poor geometric properties can generate spurious correlations in simulation model , distort results.
a long period means can generate lot of numbers before sequence wraps around start. if model needs large number of iterations or has run several seeds 2^30 or discrete numbers available typical lcg implementation may not sufficient. mt19337 algorithm has long period - 2^19337-1, or 10^5821. comparison total number of atoms in universe estimated @ 10^80.
the 32 bit integer produced mt19337 prng cannot possibly represent enough discrete values avoid repeating during such large period. in case duplicate values occur , inevitable large enough sample.
the birthday paradox in nutshell
this problem defined probability of 2 people in room sharing same birthday. key point any two people in room share birthday. people tend naively misinterpret problem probability of in room sharing birthday specific individual, source of cognitive bias causes people underestimate probability. incorrect assumption - there no requirement match to specific individual , 2 individuals match.
the probability of match occurring between 2 individuals higher probability of match specific individual match not have to specific date. rather, have find 2 individuals share same birthday. graph (which can found on wikipedia page on subject), can see need 23 people in room there 50% chance of finding 2 match in way.
from wikipedia entry on subject can nice summary. in op's problem have 4,500 possible 'birthdays', rather 365. given number of random values generated (equating 'people') want know probability of any 2 identical values appearing within sequence.
computing effect of birthday paradox on op's problem
for sequence of 100 numbers, have (100 * 99) / 2 = 4950 http://mathurl.com/yc4cdsr.png pairs (see understanding problem) potentially match (i.e. first match second, third etc., second match third, fourth etc. , on), number of combinations potentially match rather more 100.
from calculating probability expression of 1 - (4500! / (4500**100 * (4500 - 100)!) http://mathurl.com/ybfweny.png. following snippet of python code below naive evaluation of probability of matching pair occurring.
# === birthday.py =========================================== # math import log10, factorial pv=4500 # number of possible values ss=100 # sample size # these intermediate results exceedingly large numbers; # python automatically starts using bignums behind scenes. # numerator = factorial (pv) denominator = (pv ** ss) * factorial (pv - ss) # need bignums floats without intermediate # values large cast double. taking logs , # subtracting them equivalent division. # log_prob_no_pair = log10 (numerator) - log10 (denominator) # we've calculated log of probability *no* # 2 matching pairs occur in sample. probability # of @ least 1 collision 1.0 - probability no # matching pairs exist. # print 1.0 - (10 ** log_prob_no_pair)
this produces sensible looking result of p=0.669 match occuring within 100 numbers sampled population of 4500 possible values. (maybe verify , post comment if it's wrong). can see lengths of runs between matching numbers observed op seem quite reasonable.
footnote: using shuffling unique sequence of pseudo-random numbers
see this answer below s. mark means of getting guaranteed unique set of random numbers. technique poster refers takes array of numbers (which supply, can make them unique) , shuffles them random order. drawing numbers in sequence shuffled array give sequence of pseudo-random numbers guaranteed not repeat.
footnote: cryptographically secure prngs
the mt algorithm not cryptographically secure relatively easy infer internal state of generator observing sequence of numbers. other algorithms such blum blum shub used cryptographic applications, may unsuitable simulation or general random number applications. cryptographically secure prngs may expensive (perhaps requiring bignum calculations) or may not have geometric properties. in case of type of algorithm primary requirement should computationally infeasible infer internal state of generator observing sequence of values.
Comments
Post a Comment