I had a 2M line file that contained among other things function identifier strings such as:
SAL_MANAGEMENT_PORT_HANDLE::SAL_ManagementGetServerRole SAL_MANAGEMENT_PORT_HANDLE::SAL_ManagementHandleClose SAL_MANAGEMENT_PORT_HANDLE::SAL_ManagementHandleOpen
I wanted to extract just these and sort them by name for something else. I’d first tried this in vim, but it was taking too long. Eventually I control-C’ed it and realized I had to be a bit smarter about it. I figured something like perl would do the trick, and I was able to extract those strings easily with:
cat flw.* | perl -p -e 's/.*?(\S+::\S+).*/$1/;'
(ie: grab just the not-space::not-space text and spit it out). passing this to ‘sort -u’ was also taking quite a while. Here’s a slightly smarter way to do it, still also a one-liner:
cat flw.* | perl -n -e 's/.*?(\S+::\S+).*/$h{$1}=1/e; END{ foreach (sort keys %h) { print "$_\n" ; } } '
All the duplicates are automatically discarded by inserting the matched value into a hash instead of just spitting it out. Then a simple loop over the hash keys gives the result directly. For the data in question, this ended up reducing the time required for the whole operation to just 12.5seconds (eventually I ran the original ‘perl -… | sort -u’ in the background and found it would have taken 1.6 minutes). It took far less time to tweak the command line than the original command would have taken, and provides a nice example where an evaluated expression in the regex match can be handy.
Of course, I then lost my time savings by writing up these notes for posterity;)