uhhh. You mean like shuf -n NNNN ?

bo1024 · on Nov 26, 2012

I wonder if the implementation of shuf would handle very large input efficiently? Reservoir sampling wouldn't need to keep the whole input in memory, which could be an advantage. But I don't know how shuf works.

teraflop · on Nov 26, 2012

Doesn't look like it. I just tried running "yes | shuf -n 1" (using the latest version of GNU coreutils, 8.20) and its memory consumption increased steadily until I killed it.

It seems like this would be a really useful improvement, and I'm surprised that it doesn't already seem to have been requested on the coreutils issue tracker.

malcook · on Nov 27, 2012

Agreed it would be.

So, let's pursue this: http://lists.gnu.org/archive/html/coreutils/2012-11/msg00079...

malcook · on Nov 26, 2012

did you try "yes | dimsum -n 1"?

In my hands, `top` shows resident memory increasing steadily too....

It is perhaps more instructive to compare output from, for example

seq 1 1000000 | valgrind --time-unit=B --pages-as-heap=yes --trace-children=yes --tool=massif --massif-out-file=massif.dimsum.100000.out.%p dimsum -n 1

with

seq 1 1000000 | valgrind --time-unit=B --pages-as-heap=yes --trace-children=yes --tool=massif --massif-out-file=massif.shuf.100000.out.%p shuf -n 1

in my hands, shuf is faster and uses less memory for this task.

How about you?

snoble · on Nov 27, 2012

sigh, memory leak. It's fixed in github. When camilo is around I'll get him to update the gem

malcook · on Nov 27, 2012

thanks - looking forward to the patch

snoble · on Nov 27, 2012

try a `gem update`. Memory performance should be much better now but I'm still curious about speed

nburger · on Nov 26, 2012

or "sort -R"...

gwern · on Nov 26, 2012

sort -R isn't actually a random permutation or shuffle like one would ordinarily expect from a 'random' output: if you read the man page very carefully, you'll notice it only speaks of 'random hash' which means that 2 lines which are identical (quite possible and common) will be hashed to the same value and placed consecutively in the output.

(This bit me once. Filed a bug about maybe clarifying the man page, but nope, apparently we're all supposed to recognize instantly the implication of a random hash.)