Peeter Joot's (OLD) Blog.

Math, physics, perl, and programming obscurity.

comparing some times for perl vs sort command line hacking

Posted by peeterjoot on June 22, 2010

I had a 2M line file that contained among other things function identifier strings such as:


I wanted to extract just these and sort them by name for something else. I’d first tried this in vim, but it was taking too long. Eventually I control-C’ed it and realized I had to be a bit smarter about it. I figured something like perl would do the trick, and I was able to extract those strings easily with:

cat flw.* | perl -p -e 's/.*?(\S+::\S+).*/$1/;'

(ie: grab just the not-space::not-space text and spit it out). passing this to ‘sort -u’ was also taking quite a while. Here’s a slightly smarter way to do it, still also a one-liner:

cat flw.* | perl -n -e 's/.*?(\S+::\S+).*/$h{$1}=1/e; END{ foreach (sort keys %h) { print "$_\n" ; } } '

All the duplicates are automatically discarded by inserting the matched value into a hash instead of just spitting it out. Then a simple loop over the hash keys gives the result directly. For the data in question, this ended up reducing the time required for the whole operation to just 12.5seconds (eventually I ran the original ‘perl -… | sort -u’ in the background and found it would have taken 1.6 minutes). It took far less time to tweak the command line than the original command would have taken, and provides a nice example where an evaluated expression in the regex match can be handy.

Of course, I then lost my time savings by writing up these notes for posterity;)

4 Responses to “comparing some times for perl vs sort command line hacking”

  1. poisonbit said

  2. poisonbit said

    Errr wikipedia syntax is very verbose:

    @sorted = map  { $_->[0] }
              sort { $a->[1] cmp $b->[1] }
              map  { [$_, foo($_)] }

    And maybe something like:

    my @sorted = map $_->[0],
                 sort {$a->[1] cmp $b->[1]} map [ $_, foo($_) ],

    And can use ” instead of ‘cmp’ id preferred.


  3. poisonbit said

    And can use ” instead of ‘cmp’ id preferred.

    ^^^ And can use <=> instead of cmp “if you prefer”.

  4. peeterjoot said

    I’m unclear what the relevance of that is? It seemed to me that the time consuming part was probably the collection of the large set of duplicates (ie: the fact that | sort had to process so much more).

    In this case I don’t have an @unsorted array, nor do I want one.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: