Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

uhhh. You mean like shuf -n NNNN ?


I wonder if the implementation of shuf would handle very large input efficiently? Reservoir sampling wouldn't need to keep the whole input in memory, which could be an advantage. But I don't know how shuf works.


Doesn't look like it. I just tried running "yes | shuf -n 1" (using the latest version of GNU coreutils, 8.20) and its memory consumption increased steadily until I killed it.

It seems like this would be a really useful improvement, and I'm surprised that it doesn't already seem to have been requested on the coreutils issue tracker.


Agreed it would be.

So, let's pursue this: http://lists.gnu.org/archive/html/coreutils/2012-11/msg00079...


did you try "yes | dimsum -n 1"?

In my hands, `top` shows resident memory increasing steadily too....

It is perhaps more instructive to compare output from, for example

seq 1 1000000 | valgrind --time-unit=B --pages-as-heap=yes --trace-children=yes --tool=massif --massif-out-file=massif.dimsum.100000.out.%p dimsum -n 1

with

seq 1 1000000 | valgrind --time-unit=B --pages-as-heap=yes --trace-children=yes --tool=massif --massif-out-file=massif.shuf.100000.out.%p shuf -n 1

in my hands, shuf is faster and uses less memory for this task.

How about you?


sigh, memory leak. It's fixed in github. When camilo is around I'll get him to update the gem


thanks - looking forward to the patch


try a `gem update`. Memory performance should be much better now but I'm still curious about speed


or "sort -R"...


sort -R isn't actually a random permutation or shuffle like one would ordinarily expect from a 'random' output: if you read the man page very carefully, you'll notice it only speaks of 'random hash' which means that 2 lines which are identical (quite possible and common) will be hashed to the same value and placed consecutively in the output.

(This bit me once. Filed a bug about maybe clarifying the man page, but nope, apparently we're all supposed to recognize instantly the implication of a random hash.)




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: