Entry 40247 (Berkeley CSUA MOTD)

Berkeley CSUA MOTD:Entry 40247
WIKI \| FAQ \| Tech FAQ
`http://csua.com/feed/`
2025/07/13 [General] UID:1000 Activity:popular
7/13
2005/10/24-26 [Computer/SW/Languages/Perl] UID:40247 Activity:kinda low
10/24   Dear motd.  I need some perl to randomize the lines in a text file.
        I know this is easy but i have no perl-fu.  Please help.
        \_ @lines = <FILEHANDLE>;
           print splice(@lines,rand(@lines),1) while @lines;
           \_ Let's hope your file is not bigger than the amount of virtual
              memory of the machine.
              \_ Yes, since motd so often grows to fill memory on soda....
              \_ Let's... How completely pointless.  This was a 2 minute perl
                 snippet, fer chrissake...
           \_ Maybe I should show what I have and you can tell me where I'm
              going wrong, This uses a fisher-yates randomization to be
              unbiased:
              #!/usr/bin/perl
              open(FILE, "+< $_");
              while (<FILE>) {
                  push(@lines, $_);
           while (@lines) {
             print splice(@lines,rand(@lines)%@arraylength);
              }
              @reordered = fisher_yates_shuffle(@lines);
              foreach (@reordered) {
                  print $_;
              }
              sub fisher_yates_shuffle {
                  my $list = shift;  # this is an array reference
                  my $i = @{$list};
                  return unless $i;
                  while ( --$i ) {
                      my $j = int rand( $i + 1 );
                      @{$list}[$i,$j] = @{$list}[$j,$i];
                  }
              }
              \_ Your function isn't returning the array.  @reordered is being
                 set to "0" (The return value of the "while").  Change
                 "foreach(@reordered)" to "foreach(@lines)" and this code
                 should work.
           \_ Careful about using rand with %.  You can get into distribution
              problems there.
              \_ The % doesn't do anything; perl rand called that way can
                 never return >= @lines.
                 \_ Upgraded as per dbushong.  I didn't trust perl enough.
           print splice(@lines,rand(@lines)%@arraylength) while (@lines);
                 \_ Upgraded as per dbushong
              \_ Where's $_ coming from in the open() line?
        \_ #!/usr/bin/perl
           die "usage: $0 file\n" if @ARGV != 1;
           open(my $fh, '<', $ARGV[0]);
           my @offsets = (0);
           push(@offsets, tell($fh)) while <$fh>;
           pop @offsets;
           while (@offsets) {
             seek($fh, splice(@offsets, rand(@offsets), 1), 0);
             print scalar <$fh>;
           }
           close($fh);
           ## this is how i'd do it.  --dbushong
           \_ My (extremely short) version:
              /msg dbushong hey, can you write a solution to that motd thing?
           \- i have had problems using perl rand to do this on files with
              more than 32k lines. you may want to test this out ... maybe
              rand returns more values than it use to but i had to re-write
              this for larger files ... this was +5yrs ago. if you want the
              codes mail me. oh also for larger files performance can be an
              issue. [i dont mean really large files ... i typically was
              operating on about 130k entries ... 2x/16netblocks of addresses]
                                                        --psb
           \_ Actually, it looks like it's not perl rand, it's just
              manipulating the slices efficiently on large arrays that's
              making things suck.  I'll ponder.  --dbushong
           \_ OK, redone to user fisher-yates as above.  Now it only takes
              22 seconds on soda on /usr/share/dict/words:  --dbushong
              #!/usr/bin/perl
              die "usage: $0 file\n" if @ARGV != 1;
              open(my $fh, '<', $ARGV[0]);
              my @offsets = (0);
              push(@offsets, tell($fh)) while <$fh>;
              for (my $i = @offsets - 2; $i >= 0; $i--) {
                my $j = int(rand($i));
                @offsets[$i,$j] = @offsets[$j,$i] if $i != $j;
              }
              for (@offsets) {
                seek($fh, $_, 0);
                print scalar <$fh>;
              }
              close($fh);
              \- hello my codes taeke about 5-6 sec on /usr/dict/words
                 on sloda but the sloda numbers are not that stable
                 it is interesting to see the memory growth variations
                 of the different approaches. ok tnx.
                 this time i didnt check the quality of the shuffle.
                 SSH-soda{12}[~/bin]% while 1
                  loop==>  ./rand1.pl /usr/share/dict/words > /dev/null
                  loop==>  end
                 0:05.46sec, [3.961u 0.100s 74.3%], [10080Kbmax 0pf+0#swap]
                 0:06.56sec, [3.949u 0.146s 62.1%], [10078Kbmax 0pf+0#swap]
                 0:05.42sec, [3.953u 0.108s 74.7%], [10080Kbmax 0pf+0#swap]
                 0:06.70sec, [3.921u 0.172s 61.0%], [10082Kbmax 0pf+0#swap]
                 0:08.29sec, [4.041u 0.182s 50.9%], [10074Kbmax 0pf+0#swap]
                 0:05.19sec, [3.870u 0.185s 78.0%], [10074Kbmax 0pf+0#swap]
                 0:04.79sec, [3.830u 0.176s 83.5%], [10078Kbmax 0pf+0#swap]
                 0:04.55sec, [3.902u 0.159s 89.0%], [10074Kbmax 0pf+0#swap]
                 0:06.07sec, [3.917u 0.182s 67.3%], [10076Kbmax 0pf+0#swap]
                 \_ How would an Intel Critical Asset randomize a file?