Peeter Joot's (OLD) Blog.

Math, physics, perl, and programming obscurity.

Posts Tagged ‘hash key’

comparing some times for perl vs sort command line hacking

Posted by peeterjoot on June 22, 2010

I had a 2M line file that contained among other things function identifier strings such as:

SAL_MANAGEMENT_PORT_HANDLE::SAL_ManagementGetServerRole
SAL_MANAGEMENT_PORT_HANDLE::SAL_ManagementHandleClose
SAL_MANAGEMENT_PORT_HANDLE::SAL_ManagementHandleOpen

I wanted to extract just these and sort them by name for something else. I’d first tried this in vim, but it was taking too long. Eventually I control-C’ed it and realized I had to be a bit smarter about it. I figured something like perl would do the trick, and I was able to extract those strings easily with:

cat flw.* | perl -p -e 's/.*?(\S+::\S+).*/$1/;'

(ie: grab just the not-space::not-space text and spit it out). passing this to ‘sort -u’ was also taking quite a while. Here’s a slightly smarter way to do it, still also a one-liner:

cat flw.* | perl -n -e 's/.*?(\S+::\S+).*/$h{$1}=1/e; END{ foreach (sort keys %h) { print "$_\n" ; } } '

All the duplicates are automatically discarded by inserting the matched value into a hash instead of just spitting it out. Then a simple loop over the hash keys gives the result directly. For the data in question, this ended up reducing the time required for the whole operation to just 12.5seconds (eventually I ran the original ‘perl -… | sort -u’ in the background and found it would have taken 1.6 minutes). It took far less time to tweak the command line than the original command would have taken, and provides a nice example where an evaluated expression in the regex match can be handy.

Of course, I then lost my time savings by writing up these notes for posterity;)

Posted in perl and general scripting hackery | Tagged: , , , , | 4 Comments »