Peeter Joot's (OLD) Blog.

Math, physics, perl, and programming obscurity.

Posts Tagged ‘regular expression’

Have to love perl for quicky automated source changes.

Posted by peeterjoot on July 16, 2010

Looking at some code today of the following form:

   char buf[10] ;
   sprintf( buf, "%s ... ", somefunction() ) ;

Where somefunction returns a char *. Very unsafe code since you could easily overflow buf and have all sorts of fun stack corruptions to deal with. This was repeated about 400 times in the modules in question, and it’s desirable to replace these all with snprintf calls to ensure there is no bounds error (in DB2 we use a different version of snprintf due to some portability issues, but the idea here is the same).

Here’s a nice little one liner to make the code changes required:

perl -p -i -e 's/\bsprintf *\( *(.*?), */snprintf( $1, sizeof($1), /' LIST_OF_FILENAMES

It’s not perfect, but does the job nicely in the bulk of the call sites, adding as desired, the additional sizeof() parameter to the call and changing the function name. Of course thorough review is required with context, since you don’t want to be taking sizeof() of a char * argument and get the size of a pointer.

Advertisements

Posted in C/C++ development and debugging. | Tagged: , , , , | Leave a Comment »

comparing some times for perl vs sort command line hacking

Posted by peeterjoot on June 22, 2010

I had a 2M line file that contained among other things function identifier strings such as:

SAL_MANAGEMENT_PORT_HANDLE::SAL_ManagementGetServerRole
SAL_MANAGEMENT_PORT_HANDLE::SAL_ManagementHandleClose
SAL_MANAGEMENT_PORT_HANDLE::SAL_ManagementHandleOpen

I wanted to extract just these and sort them by name for something else. I’d first tried this in vim, but it was taking too long. Eventually I control-C’ed it and realized I had to be a bit smarter about it. I figured something like perl would do the trick, and I was able to extract those strings easily with:

cat flw.* | perl -p -e 's/.*?(\S+::\S+).*/$1/;'

(ie: grab just the not-space::not-space text and spit it out). passing this to ‘sort -u’ was also taking quite a while. Here’s a slightly smarter way to do it, still also a one-liner:

cat flw.* | perl -n -e 's/.*?(\S+::\S+).*/$h{$1}=1/e; END{ foreach (sort keys %h) { print "$_\n" ; } } '

All the duplicates are automatically discarded by inserting the matched value into a hash instead of just spitting it out. Then a simple loop over the hash keys gives the result directly. For the data in question, this ended up reducing the time required for the whole operation to just 12.5seconds (eventually I ran the original ‘perl -… | sort -u’ in the background and found it would have taken 1.6 minutes). It took far less time to tweak the command line than the original command would have taken, and provides a nice example where an evaluated expression in the regex match can be handy.

Of course, I then lost my time savings by writing up these notes for posterity;)

Posted in perl and general scripting hackery | Tagged: , , , , | 4 Comments »

stripping color control characters from lssam output.

Posted by peeterjoot on March 17, 2010

There’s probably a billion ways to do this, but here’s one that appears to work.  If you have TSA command output that has been collected by a somebody who did not use the –nocolor option.

perl -p -e 's/\x1b\[\d+m//g;' < lssamBEFORE.out

The \x1b is the ESC character.  This says to remove that ESC followed by a ‘[‘ character, then 1 or more digits and the character ‘m’, and do it for all lines.

Posted in perl and general scripting hackery | Tagged: , , , | Leave a Comment »

A fun regular expression for the day. change all function calls to another.

Posted by peeterjoot on December 18, 2009

Hit some nasty old school code today that dates back to our one-time 16-bit OS/2 port. I figured out that 730 lines of code for an ancient function called sqlepost() could all be removed if I could make a change of all lines like so:

- sqlepost(SQLT_SQLE, SQLT_SQLE_SUBCOORD_TERM, 122, SQLE_EBAD_DB_ERR, sizeof(eRC), &eRC);
+ pdLog( PD_DEV, SQLT_SQLE_SUBCOORD_TERM, eRC, 122, PD_LEVEL_SEV, 0 ) ;

(83 places). A desirable side effect of making this change is that we will stop logging the return code as a byte reversed hex number, and instead log it as a return code. Easier on developers and system testers alike.

perl -p is once again a good friend for this sort of task

s/sqlepost\s*\(
\s*(.*?)\s*, # componentID -- unused.
\s*(.*?)\s*, # functionID
\s*(.*?)\s*, # probe
\s*(.*?)\s*, # index -- unused.
\s*(.*?)\s*, # size -- unused.
\s*&(.*?)\s*\)\s*; # rc
/pdLog( PD_DEV, $2, $6, $3, PD_LEVEL_SEV, 0 ) ;/x ;

I made a quick manual modification of each of the call sites that weren’t all in one line, with control-J in vim to put the whole function call on one line, then just had to run:

perl -p -i ./replacementScript `cat listOfFilesWithTheseCalls`

Voila! Very nice if I have to say so myself;)

EDIT: it was pointed out to me that the regular expressions used above are not entirely obvious.  Here’s a quick synopsis:

\s       space
.        any character
*        zero or more of the preceding
(.*)     capture an expression (creates $1, $2, ...)
            ie. zero or more of anything.
(.*?)    capture an expression, but don't be greedy, only capturing the
            minimal amount.
\(       a plain old start brace character (ie. non-capturing)
\)       a plain old end brace character.

Posted in C/C++ development and debugging., perl and general scripting hackery | Tagged: , , | Leave a Comment »

extract a parameter from a function call, and add it as a parameter to another.

Posted by peeterjoot on September 15, 2009

Here’s a fun change using the same evaluated regular expression “template” used in a few previous posts.  The object is to find the start and end of a function body (knowing the coding conventions in force), if it has a function call of a certain name, extract the first parameter in that function call and add it to the last parameter of all calls (if any) of two other functions.

I’ll let the script speak for itself, since it’s only slightly different from the ones detailed in previous posts. I’ve commented the regular expressions used to find the function start/stop (using the /x modifer that allows spacesand comments in the expression).

#!/usr/bin/perl

while (<>)
{
   $p .= $_ ;
}

$p =~ s/
(           # begin capture
^{          # match newline-brace-start
.*?         # other stuff but not greedy.
^}          # brace end at the beginning of the line
\s*$        # followed by nothing interesting (opt spaces then newline)
)           # end capture
/foo("$1")/smegx ;
print $p ;

exit ;

sub foo
{
   my $s = "@_" ;

   # find and extract the first parameter
   if ( $s =~ /TraceEntry\d* *\( *(.*?),/m )
   {
      my $tp = $1 ;

      # change any calls like: ASSERT_NONNULL(..., 0 ) to ASSERT_NONNULL( $tp )
      $s =~ s/(^ *ASSERT_NONNULL.*?), *0 *\) *; *$/$1, $tp ) ;/mg ;

      # change any calls like: ASSERT_THIS( 0 ) to ASSERT_THIS( $tp )
      $s =~ s/^( *ASSERT_THIS) *\( *0 *\) *; *$/$1( $tp ) ;/mg ;
   }

   return "$s" ;
}

Posted in perl and general scripting hackery | Tagged: , , | Leave a Comment »

A combined application for grep -n ; vim -q ; and perl evaluated regex

Posted by peeterjoot on September 11, 2009

Now that I’ve learned of how to use evaluated replacement expressions in perl it’s become my new favorite tool. Here’s today’s application, using it as a query engine to figure out all the calls of a particular function that I want to look at in the editor and probably modify.

I’m interested in editing a subset of the function calls for the module in a given directory. I can find them and their line numbers with:

grep -n printIt.*BLAH *.C

But there’s 90 of these function calls, and I know most don’t need alteration. If I grep with context, say grabbing 20 lines of context after the search expression, I can see which of these are of interest:

grep -nA20 printIt.*BLAH *.C | tee grep.out

I really want to weed out all the calls that also do NOT contain additional expressions. Illustrating by example, a fragment of the grep output above had in it:

foo.C:6197:   printIt( BLAH,
foo.C-6198-          ...
foo.C-6200-          INFORMATIONAL,
foo.C-6205-          ...
foo.C-6210-          ) ;

Any of these calls that happen to have INFORMATIONAL or DUMPIT strings in them aren’t of interest, so I take my pre-canned evaluated regex perl script (see previous posts for an explaination) and modify it slightly.

This time I use:

# cat ./thisFilterScript
#!/usr/bin/perl

while (<>)
{
   $p .= $_ ;
}

$p =~ s/(printIt.*?;)/foo("$1")/smeg ;
print $p ;

exit ;

sub foo
{
   my $s = "@_" ;

   return "" if ( $s =~ /INFORMATIONAL/ or $s =~ /DUMPIT/ ) ;

   return "$s" ;
}

Run this on the grep output, and I’ve now reduced it to just a listing of the calls of interest:

# cat grep.out | ./thisFilterScript > grep.filtered

This is now just the filename:linenumber:output expressions for each of the function calls of interest.

# cat grep.filtered
foo.C:6303:         printIt( BLAH,
foo.C:6344:         printIt( BLAH,
foo.C:10298:   printIt( BLAH,
foo.C:10325:   printIt( BLAH,

I can now simply run ‘vim -q ./grep.filtered’, and I go straight to the line for the first hit (with :cn to get to the next when done with editing that call site).

Posted in perl and general scripting hackery | Tagged: , , , , | Leave a Comment »

regular expression driven code alteration

Posted by peeterjoot on September 2, 2009

Exersize. Have code with repeated blocks of (trace-stuff, return), like the following:

if ( foo )
{
   TraceData( TraceId, 10, NULL, 0 );
   TraceExit( TraceId, FALSE );
   return FALSE;   
}

...
if ( bar )
{
   TraceData( TraceId, 20, NULL, 0 );
   TraceExit( TraceId, FALSE );
   return FALSE;   
}

The function in question actually has many of these, and a goto would work well to consolidate them and make it harder to miss the TraceExit’s. Yes, some people will argue that gotos are evil, but I’m working with a codebase that uses them regularily to enforce a single return point, and there’s no point fighting with 20 years of historical inertia. We go with the flow, and add the following to the end of the function

TRACE_AND_EXIT:

    TraceData( TraceId, probe, NULL, 0 );
    TraceExit( TraceId, matched );

    return matched ;
}

Now, the job becomes taking this repeated three line sequence and replacing it with something that doesn’t generate two trace function calls at every return point, and avoids the multiple return points. That is

   probe = 10 ; matched = FALSE ;
   goto TRACE_AND_EXIT ;

Being lazy, but too playful for my own good, I recycle a previous simple but powerful code alteration script detailed in a previous blog post, and produce with only alteration of two regular expressions

while (<>)
{
   $p .= $_ ;
}

$p =~ s/^(\s+)(TraceData.*?return\s+.*?\S+ *;)/foo($1, "$2")/smeg ;
print $p ;

exit ;

my $probe = 0 ;

sub foo
{
   my ($leadingSpaces, $rest) = @_ ;

   # not bothering with the old probe points.  Just renumber them 10, 20, 30 ...
   $probe += 10 ;

   $rest =~ /return\s+(.*?) *;/sm or die ;

   return "${leadingSpaces}probe = $probe ; matched = $1 ;\n${leadingSpaces}goto TRACE_AND_EXIT ;" ;
}

I really only want to apply this to the body of the current function. I can do that by positioning myself near the beginning of the function in vi, and using this script as a filter to modify from the current line all the way to the goto LABEL that I just added:

,/TRACE_AND_EXIT/ !perl ./thisHackyScript

There’s a few new things in this one liner. Like all other vi commands, we start with a range of line numbers

:N,M

leaving off the first number means start from the current line. I’ve used a pattern instead of a line number for the end range for the vi command, so I want this to run til the expression TRACE_AND_EXIT is encountered. The last bit is to filter the output through a command, in this case the hacky little perl script above which I’ve put in the local directory as ./thisHackyScript

Now, I’ve also defaulted matched = FALSE in its declaration, so make a final cleanup pass in vi, again positioning myself near the beginning of the function:

,/^}/ s/ matched = FALSE ;//c

As above the beginning and ending line numbers for this alteration are the current line number and an expression, and the command to apply is the replacement of all ‘ matched = FALSE ;’ text with nothing. The c modifier at the end here is a very handy and says “prompt for all changes”. If you like what it is doing and haven’t made an error with your replacement expression (hard to do here), then answering the prompt with ‘a’ quits the prompt and just does it.

Finito. Now did this explaination make any sense… perhaps not. As a fall back you can grab the little filter script and play with it to figure it out and then have a nice little tool for some semi-automatic mucking around in your own code.

Posted in perl and general scripting hackery | Tagged: , , | Leave a Comment »

perl regex tricks. Non capturing expressions and alternatives.

Posted by peeterjoot on August 27, 2009

Have two sets of dump output to compare, and both have the occasional pointer dumped which messes up the diff. I want to mask the pointer output (all starting with 0x) on all the lines like:

 List Entry Address:           0x...
 List Tail (primary):             0x...
 List Tail (secondary):           0x...
 Next entry name collision:    0x...
 Next entry PLEID collision:   0x...
 Next entry (primary):         0x...
 Next entry (secondary):       0x...
 Previous entry (primary):     0x...
 Previous entry (secondary):   0x...

An easy way would be to run ‘grep -v’ and just filter these out completely, but I wanted the original line numbers to stay intact for reference.

Here’s a one liner perl script, executed with ‘perl -pi ./myScript *.fmt’ (where the files *.fmt are what I’m mucking with) :

$ cat myScript
s/((?i:Next|List|previous) (?i:entry|head|tail).*0x).*/$1................/;

Since I had to lookup (man perlre) how to do this once again, it’s a good blog topic for self reference. Let’s break it down. First thing is an outermost capturing pattern

s/(stuff.*0x).*/$1................/;

this says match ‘stuff.*0x’, namely ‘stuff’ followed by anything (the .* part), then ‘0x’, then anything. All of this within the braces goes into $1, so the replacement is everything on the line except whatever follows 0x (and for that I replaced with 16 dots). Now look at the nested expression before the .*0x part:

(?i:Next|List|previous) (?i:entry|head|tail)

More perl ASCII barf starting things off, but it’s not so bad. If you have an expression like (?:stuff) it means match ‘stuff’ but don’t capture it (i.e. don’t put it in $2 or $3, …). Only slightly more complex is having alternatives in the pattern, so something like (?:Next|List) means match Next or match List, but also don’t put anything into a $N variable. There’s one more bit in there unexplained, the ‘i’ modifier flag. This is a way to add case Insensitive to the pattern. In this case I could have made that a global flag at the end of the replace specification so it would apply to the whole pattern:

s/((?:Next|List|previous) (?:entry|head|tail).*0x).*/$1................/i;

but initially I had the case Insensitive modifier only on one of the patterns, so the final result ended up with some redundancy.

Posted in perl and general scripting hackery | Tagged: , | Leave a Comment »

another good regex trick. Matching a word boundary.

Posted by peeterjoot on August 26, 2009

Task: Have a badly named variable, in this case caKeyValue, and want to change this in a number of files to caKeySample.

I’ve also got variables named m_caKeyValue that I don’t want to change. Once I’m done this replacement, all the variables left with caKeyValue in their names will be the ones I’m interested in, and I can examine all of those in sequence to make sure that I’m treating those right.

A regular expression with a word boundary pattern is the trick. Here’s a sample command, starting with a small file that has my search and replace patterns:

$ cat myPatterns
# if there are trailing spaces then try not to mess up indenting:
s/\bcaKeyValue\b  /caKeySample /g;

# but mess up indenting if there's no option:
s/\bcaKeyValue\b/caKeySample/g;

and here’s the perl command line invocation to do the replacement:

$ perl -pi ./myPatterns `cat listOfFiles`

Let’s break it down. First the command. We use -p -i perl command line flags as explained in previous posts, this treats the perl script like it’s a while loop and modifies all the files (in this case without backup since I’ve just checked them out of the version control system).

If the perl script containing the search and replace patterns I want were to contain just:

s/caKeyValue/caKeySample/g;

Then all instances of caKeyValue would be replaced. I only want this if they aren’t embedded in something else (like m_caKeyValue). If I only cared about not mucking with variables named m_caKeyValue then a sufficient replacement expression would be:

s/\bcaKeyValue/caKeySample/g;

This says it’s okay to do the replacement if something trails “caKeyValue” without spaces, so caKeyValues, say, would be replaced by caKeySamples. To be careful I’m telling perl to be stricter, requiring something that is recognized as a separator (like whitespace) at both the beginning and the end of the expression. That is:

s/\bcaKeyValue\b/caKeySample/g;

Now, a diff of the results with the originals in the version control system (something to _always_ do with automated code changes) showed that I was messing up the indentation in some cases, as in the following diff fragment:

-   SAL_ENCODED_CA_STATE                caKeyValue        = m_caKey.SAL_SampleCaKeyValue() ;
+   SAL_ENCODED_CA_STATE                caKeySample        = m_caKey.SAL_SampleCaKeyValue() ;

So, to compensate, I undid my automated change and added a first replacement pattern to keep things pretty:

s/\bcaKeyValue\b  /caKeySample /g;

This one is match the expression with two trailing spaces and replace it with the desired, plus one trailing space. Note that the trailing \b is spurious in this case, but since I was cut and pasting the regex based on the initial try, I had this extra bit and it doesn’t change the desired result.

Posted in perl and general scripting hackery | Tagged: , | Leave a Comment »

dirty perl tricks. using evaluations in a replacement expression

Posted by peeterjoot on August 6, 2009

I’ve gone and done a search and replace of a return type everywhere in a certain file, and the new typedef name has more characters than the original. Now the nicely indented prologues for each function are all messed up like so:

inline SAL_CA_STATUS_TYPE SQLE_CA_CONN_ENTRY_DATA::sqleCaCeWriteSA( CASA_t * const          SAToken,
                                                            SAL_CA_PAGENAME_TYPE * const    pSaPageName,
                                                            const Uint8             newElement,
                                                            const Uint8             cond,
                                                            const Uint8             maxagg,
                                                            const Uint8             increasing,
                                                            const Uint64            input,
                                                            Uint64   * const        aggregate,
                                                            bool * const            pbDispatchError )

I want things indented by eight characters, on all the lines after the ones that start with ‘inline SAL_CA_STATUS_TYPE’ till the end brace that marks the end of the argument list. It should look like:

...
inline SAL_CA_STATUS_TYPE SQLE_CA_CONN_ENTRY_DATA::sqleCaCeWriteSA( CASA_t * const          SAToken,
                                                                    SAL_CA_PAGENAME_TYPE * const    pSaPageName,
                                                                    const Uint8             newElement,
                                                                    const Uint8             cond,
                                                                    const Uint8             maxagg,
                                                                    const Uint8             increasing,
                                                                    const Uint64            input,
                                                                    Uint64   * const        aggregate,
                                                                    bool * const            pbDispatchError )
{
...

Kind of a silly exersize in prettying things up, but the new poor formatting makes things harder to read, and is distracting for maintainance. I’ve got 44 such functions in this file and don’t want to do them manually.

Is there an easy way to indent all the lines after the first by the eight characters needed in this case? I’ve wanted to do scripted changes like this before (like add an argument to all function calls matching some pattern), so it seemed like it was worth a few minutes to play with it. Here’s what I came up with:

#!/usr/bin/perl

while (<>)
{
   $p .= $_ ;
}

$p =~ s/^(inline SAL_CA_STATUS_TYPE.*?\))/foo("$1")/smeg ;
print $p ;

exit ;

sub foo
{
   my $s = "@_" ;
   $s =~ s/^ /         /smg ;

   return "$s" ;
}

Here’s a breakdown of what this does and how. The first problem is that I want to operate on the whole file and not on a line by line basis. This loop:

while (<>)
{
   $p .= $_ ;
}

sucks up each line from stdin and puts it all in a working variable $p.

Now that I’ve got all 19000 lines of the file in a working variable (yes, I should probably split up my file;), I want to match all instances of any lines that start with ‘inline SAL_CA_STATUS_TYPE’ until the first ending brace for the end of the argument list. I don’t have any function pointer arguments so I can match til the first ) after the starting expression. So, a match expression that does the job is:

/^inline SAL_CA_STATUS_TYPE.*?\)/

The caret says match the beginning of the line, and since ) is a special character in perl I have to escape it. I also don’t want to match past the first ) so I use a non-greedy pattern ‘.*?’ … meaning match anything but stop at the earliest point based on context. Next I want to put all of this into a variable I can refer to in the replacement expression (that is $1 ), so wrap the whole thing in in the capture pattern (). That leaves me with:

/^(inline SAL_CA_STATUS_TYPE.*?\))/

Since I want to do this for all matches, I need the g modifier at the end, and since the text is multiline, I need /sm too. If I wanted a pure text change at this point, I could do something like:

$p =~ s/^(inline SAL_CA_STATUS_TYPE.*?\))/blah$1blah/smg ;

This would wrap all instances of the pattern with blah blah, like a quoting operation. What I want though, is to extract all matches to this first pattern and do more to it. The /e modifier does that, and allows the replacement expression to be code. I wrote a quick function that in turn did my second search and replace:

sub foo
{
   my $s = "@_" ;
   $s =~ s/^ /         /smg ;

   return "$s" ;
}

In this helper function, I replace all lines starting with one character with nine, and voila I’m done. Takes longer to explain this throw away script than to write it, and I now have a template for other similar automated changes in the future.

Posted in perl and general scripting hackery | Tagged: , | Leave a Comment »