Awk...short intro
Oct 13, 2006



Awk -- A Pattern Scanning and Processing Language

There are three variations of AWK:
AWK - the original from AT&T
NAWK - A newer, improved version from AT&T
GAWK - The Free Software foundation's version

It was created in late 70th of the last century. The original book that describes AWK is from Alfred V. Aho, Brian W. Kernighan, and Peter J. Weinberger The Awk Programming Language, Addison-Wesley, 1988.

There are several things in Awk compared to Perl or others:

  - awk is simpler 
  - awk syntax is far more regular 
  - you may already know awk well enough for the task at hand
  - awk can be smaller and much quicker to execute for small programs
  - yet, it is not suited for everything, as you may guess


See more about Awk here or here
Intro to shell scripting here


NB: Apple file
tr '\r' '\n' < macfile.txt > unixfile.txt
or
awk '{ gsub("\r", "\n"); print $0;}' macfile.txt > unixfile.txt
To convert a Unix file to Mac OS using awk, at the command line, enter:
awk '{ gsub("\n", "\r"); print $0;}' unixfile.txt > macfile.txt


Delete BOTH leading and trailing whitespace from each line
sed 's/^[ ^t]*//;s/[ ^]*$//' file   


Simple sed for html
sed -e 's/target="_blank">http.*</target="_blank">LINK<\/a></g' my.html > my2.html
this runs with sed on mac, warning with the different OS


------AWK------



An awk program is a sequence of statements of the form:
        pattern   { action }
        pattern   { action }

Each  line  of  input  is matched against each of the patterns.  For each pattern that matches, the  associated  action  is
executed.  When all the patterns have been tested, the next line is fetched and the matching starts over.



Examples
awk 'NR == 1 {print  $2}' mydatatfile
This says select line one and do action {..}, here print field 2

awk 'length > 10' myfile
This prints every line that is longer than 10 characters

awk '{ $1 = log($1); print }' myfile
Replaces the first field of each line by its logarithm

awk '$2 ~ /A|B|C/' myfile
Prints all input lines with an A, B, or C in  the  second  field

awk '$2 ~/0.1/' myfile > myoutput
Prints all lines with 0.1 in the second field and copy them in file myoutput

awk '{print "end"; print $0}' myfile
This prints end in between each line

awk '{print "\$" $0}' myfile
prints $ infront each line

sed 's/\$/\"/
'
substitutes each $ by "

sed 's/\> \<ID\>/\> \<ID_STRUCTURE\>/g' file_input.sdf > fileoutput.sdf
substitutes each > <ID> by > <ID_STRUCTURE> for LigandInfo for example

This prints the last field of each line
awk '{ print $NF }' myfile

Print the last field of the last line
awk '{ field = $NF }; END{ print field }'
myfile

Print every line with more than 4 fields
 awk 'NF > 4'
myfile

Print every line where the value of the last field is > 4
 awk '$NF > 4'
myfile

Some other examples
awk 'BEGIN {i=1; while (i<=10){ print i*i; i++}}'

awk 'BEGIN {col = 13; {print col}}'

awk 'BEGIN {lines=0} {lines++} END {print lines}' myfile
(somehow like wc -l)


Change the field separator:
if in myfile i have:
uuuu:kkkk:lllll5676
uuuu:kkkk:lllll8999
uuuu:kkkk:lllll00999

awk 'BEGIN {FS=":"} {print $2}' myfile
I force the separator to be : and I keep field 2


awk '$3 == 0 {print $1}' myfile1 myfile2
If field 3 = 0, print field 1 of my two files

awk '$2 > 0.5 {col = col +1} END {print col}'
prints the number of time you have numbers above 0.5 in field 2

Print lines with the word "brian" in them:
awk '/brian/   { print  $0 }' myfile

Print each input line preceded with a line number
print the heading which includes the name of the file
awk 'BEGIN  {  print "File:", FILENAME } { print NR, ":\t", $0 }' myfile

awk '{name = name $1} END {print name}'
Print all on one line


Insert 5 blank spaces at beginning of each line
awk '{sub(/^/, "     ");print}'

Substitute "foo" with "bar" EXCEPT for lines which contain "baz"
 awk '!/baz/{gsub(/foo/, "bar")};{print}'

Change "scarlet" or "ruby" or "puce" to "red"
 awk '{gsub(/scarlet|ruby|puce/, "red"); print}'

Remove duplicate, consecutive lines (emulates "uniq")
 awk 'a !~ $0; {a=$0}'

Remove duplicate, nonconsecutive lines
 awk '! a[$0]++'                     # most concise script
 awk '!($0 in a) {a[$0];print}'      # most efficient script

Print the first line
awk 'NR <2' test

Print the last line of a file (emulates "tail -1")
 awk 'END{print}'

Print only lines which match regular expression (emulates "grep")
 awk '/regex/'

Print only lines which do NOT match regex (emulates "grep -v")
 awk '!/regex/'

Print section of file from regular expression to end of file
 awk '/regex/,0'
 awk '/regex/,EOF'

Print section of file based on line numbers (lines 8-12, inclusive)
 awk 'NR==8,NR==12'

Print line number 52
 awk 'NR==52'
 awk 'NR==52 {print;exit}'          # more efficient on large files

Print section of file between two regular expressions (inclusive)
 awk '/Iowa/,/Montana/'             # case sensitive


Delete ALL blank lines from a file (same as "grep '.' ")
 awk NF myfile > myoutput
 awk '/./'

awk '{n=5 ; print $n}' myinput
prints the fifth field in the input record

To list a file, but skip over all the blank lines at the start of the file, use the command:
awk "/[^ ]/ { copy=1 }; copy { print }" filename.ext

To list all the lines of a file (TMP.LOG), except those containing the string "Frame overlap", you can use the command:
awk "!/Frame overlap/" TMP.LOG

Adding a blank line after lines in a list
To add a blank line after all lines containing "util.h":
awk "{ print $0; if ( $0 ~ /util.h/) print \"\" }" TMP.TMP

expression: none do this for all lines
action: print $0; if ( $0 ~ /util.h/ ) print ""  print the line, then if the line contains "util.h" print a blank line

to add a blank line at the end of a file:
awk '{print $0} END {print ""}' myfile.txt > myoutput.txt


sed '
/$$$$/ {
N
/\n.*ISIS/ {
s/$$$$.*\n.*ISIS/$$$$ ISIS/
}
}'


awk 'BEGIN {for (x=1; x<=50; ++x) {printf("%3d\n",x) >> "tfile"}}'
dumps the numbers from 1 to 50 into "tfile".

 Output can also be "piped" into another utility with the "|" ("pipe") operator. One can pipe output to the "tr" ("translate") utility to convert it to upper-case:
awk 'BEGIN { print "this is a test"}' | tr "[a-z]" "[A-Z]" 
yields:
   THIS IS A TEST






HOW TO RUN AWK

-You can type: awk '{.....}' myinputfile > myoutputfile
or >>
myoutputfile to append to an existing file

-You can put the script in a file and run: awk -f myscript myfile...and many other ways around
things can depend on the shell you are using, sh, csh, tcsh, bash...so watch out the behavior

-F fs
 Sets the FS variable to fs (see section Specifying how Fields are Separated).
 -f source-file
 Indicates that the awk program is to be found in source-file instead of in the first non-option argument.
 -v var=val
 Sets the variable var to the value val before execution of the program begins. Such variable values are available inside the BEGIN rule (see below for a fuller explanation).  The `-v' option can only set one variable, but you can use it more than once, setting another variable each time, like this: `-v foo=1 -v bar=2'.





ESSENTIAL SYNTAX


Arithmetic
Operator    Type      Meaning            
+        Arithmetic   Addition        
-        Arithmetic   Subtraction    
*        Arithmetic   Multiplication 
/        Arithmetic   Division        
%       Arithmetic   Modulo        
++              increment
 --              decrement
^               exponential
 +=              plus equals
 -=              minus equals
 *=              multiply equals
 /=              divide equals
 %=              modulus equals
 ^=              exponential equals 



awk 'BEGIN {OFS="\t"} {print $1,$2,$3,(($1+$2+$3)/3)}' IN > OUT
Will print out column 1, column2, column 3, and the mean of the 3 columns

awk '{print ($1/66),($2/6430),($3/627)}' IN>OUT
Dividing each column by different numbers
awk '{printf "%.3f\t%.3f\t%.3f\n", ($1/66),($2/6430),($3/627)}' IN>OUT
printf to format 3 significant digits and separate with tabs


Conditional expressions
Operator   Meaning              
==             Is equal              
!=          Is not equal to          
>          Is greater than          
>=      Is greater than or equal to 
<          Is less than          
<=      Is less than or equal to      


Regular Expression Operators 
Operator     Meaning         
~                Matches         
!~         Doesn't match  

ex: word !~ /START/


AND and OR and not matching
&& and ||  and !

Built-in VARIABLES


Records and Fields
Awk input is divided into records terminated by a record separator.   The  default  record  separator  is a newline, so by default awk processes its input a line at a time.  The number of the current record is available in a variable named NR.
Each   input   record  is  considered  to  be  divided  into fields.  Fields are normally  separated  by  white  space  --blanks  or  tabs -- but the input field separator may be changed.  Fields are referred to as  $1,  $2,  and  so forth,  where  $1  is  the first field, and $0 is the whole input record itself.  Fields may be assigned too.  The number of  fields in the current record is available in a variable named NF.
The  variables FS and RS refer to the input field and record separators; they may be changed at any time to any single character.   The optional command-line argument -Fc may also be used to set FS to the character c.
The variable FILENAME contains the name of the current input file.


print $0, this prints the full line, if it has 8 fields, it does equivalent to:
print $1, $2, $3, $4, $5, $6, $7, $8




FS - The Input Field Separator
The input field separator, a blank by default

FNR
The input record number in the current input file

OFS - The Output Field Separator
The output field separator, a blank by default

ORS - The Output line record Separator
The output record separator, by default a newline


NF - The Number of Fields
Awk counts the number of fields in the input line and put it into a variable called NF.
awk '{print NF, $NF}' myinput
print the number of field and the last field of each line

NR - The Number of Records - the current input line number
Awk counts the number of lines it reads
awk '{print NR, $0}'
This prints the line number and the complete line

RS - The input line Record Separator (default = newline)
The input record separator, by default a newline

Change the RS


'BEGIN {
# change the record separator from newline to nothing   
    RS=""
# change the field separator from whitespace to newline
    FS="n"
}
{
# print the second and third line of the file
    print $2, $3}' myfile

if myfile is:
50.211 14.979 24.196
50.142 15.162 25.415

awk 'BEGIN {RS=""; FS="n"} {print $2}' myfile
gives:
50.142 15.162 25.415




Arrays
awk provides single dimensioned arrays. Arrays need not be declared, they are created in the same manner as awk user defined variables.
Elements can be specified as numeric or string values.


Length
this counts the number of characters in a string
awk '{print length($0)}' myfile


Print and Printf
The print statement does output with simple, standardized formatting. You specify only the strings or numbers to be printed, in a list separated by commas. They are output, separated by single spaces, followed by a newline. The statement looks like this:
print item1 , item2 , ...

The simple statement print with no items is equivalent to print $0
it prints the entire current record. To print a blank line, use 'print ""', where "" is the null, or empty, string.


Using printf Statements for Fancier Printing
A format specifier starts with the character % and ends with a format-control letter; it tells the printf statement how to output one item. The format-control letter specifies what kind of value to print. The rest of the format specifier is made up of optional modifiers which are parameters such as the field width to use.

 Here is a list of the format-control letters:
 c    This prints a number as an ASCII character. Thus, 'printf "%c", 65' outputs the letter A. The output for a string value is the first character of the string.
 d     This prints a decimal integer.
 i    This also prints a decimal integer.
 e    This prints a number in scientific (exponential) notation. For example,
printf "%4.3e", 1950

 prints 1.950e+03, with a total of four significant figures of which three follow the decimal point. The 4.3 are modifiers, discussed below.
f    This prints a number in floating point notation.
g      This prints a number in either scientific notation or floating point notation, whichever uses fewer characters.
o    This prints an unsigned octal integer.
s      This prints a string.
x     This prints an unsigned hexadecimal integer.

%    This isn't really a format-control letter, but it does have a meaning when used after a %: the sequence `%%' outputs one '%'. It does not consume an argument.

A format specification can also include modifiers that can control how much of the item's value is printed and how much space it gets. The modifiers come between the '%' and the format-control letter. Here are the possible modifiers, in the order in which they may appear:
 '-'
The minus sign, used before the width modifier, says to left-justify the argument within its specified width. Normally the argument is printed right-justified in the specified width. Thus,
printf "%-4s", "foo"

 prints 'foo '.

'width'
This is a number representing the desired width of a field. Inserting any number between the '%' sign and the format control character forces the field to be expanded to this width. The default way to do this is to pad with spaces on the left. For example,
printf "%4s", "foo"

 prints ' foo'.  The value of width is a minimum width, not a maximum. If the item value requires more than width characters, it can be as wide as necessary. Thus,
printf "%4s", "foobar"

 prints 'foobar'.  Preceding the width with a minus sign causes the output to be padded with spaces on the right, instead of on the left.
 '.prec'
This is a number that specifies the precision to use when printing. This specifies the number of digits you want printed to the right of the decimal point. For a string, it specifies the maximum number of characters from the string that should be printed.

The C library printf's dynamic width and prec capability (for example, "%*.*s") is supported. Instead of supplying explicit width and/or prec values in the format string, you pass them in the argument list. For example:
w = 5
p = 3
s = "abcdefg"
printf "<%*.*s>\n", w, p, s


is exactly equivalent to
s = "abcdefg"
printf "<%5.3s>\n", s

Both programs output '<**abc>'. (the bullet symbol "*" is used to represent a space, to clearly show you that there are two spaces in the output.)


PRINT AND PRINTF again
The simplest output statement is the by-now familiar "print" statement. There's not too much to it:

    •      "Print" by itself prints the input line.

    •      "Print" with one argument prints the argument.

    •      "Print" with multiple arguments prints all the arguments, separated by  spaces (or other specified OFS) when the arguments are separated by commas, or concatenated when the arguments are separated by spaces.

 * The "printf()" (formatted print) function is much more flexible, and trickier. It has the syntax:
   printf(<string>,<expression list>)

 The "string" can be a normal string of characters:
   printf("Hi, there!")

 This prints "Hi, there!" to the display, just like "print" would, with one slight difference: the cursor remains at the end of the text, instead of skipping to the next line, as it would with "print". A "newline" code ("\n") has to be added to force "printf()" to skip to the next line:
   printf("Hi, there!\n")

 So far, "printf()" looks like a step backward from "print", and if you use it to do dumb things like this, it is. However, "printf()" is useful when you want precise control over the appearance of the output.

 The trick is that the string can contain format or "conversion" codes to control the results of the expressions in the expression list. For example, the following program:
   BEGIN {x = 35; printf("x = %d decimal, %x hex, %o octal.\n",x,x,x)}

 -- prints:
   x = 35 decimal, 23 hex, 43 octal.

 The format codes in this example include: "%d" (specifying decimal output), "%x" (specifying hexadecimal output), and "%o" (specifying octal output). The "printf()" function substitutes the three variables in the expression list for these format codes on output.

 * The format codes are highly flexible and their use can be a bit confusing. The "d" format code prints a number in decimal format. The output is an integer, even if the number is a real, like 3.14159. Trying to print a string with this format code results in a "0" output. For example:
   x = 35;     printf("x = %d\n",x)       yields:  x = 35
   x = 3.1415; printf("x = %d\n",x)       yields:  x = 3
   x = "TEST"; printf("x = %d\n",x)       yields:  x = 0

 * The "o" format code prints a number in octal format. Other than that, this format code behaves exactly as does the "%d" format specifier. For example:
   awk 'BEGIN {x = 255; printf("x = %o\n",x)}'          yields:  x = 377

 * The "x" format code prints a number in hexadecimal format. Other than that, this format code behaves exactly as does the "%d" format specifier. For example:
   x = 197; printf("x = %x\n",x)          yields:  x = c5

 * The "c" format code prints a character, given its numeric code. For example, the following statement outputs all the printable characters:
   BEGIN {for (ch=32; ch<128; ch++) printf("%c   %c\n",ch,ch+128)}

 * The "s" format code prints a string. For example:
   x = "jive"; printf("string = %s\n",x)  yields:  string = jive

 * The "e" format code prints a number in exponential format, in the default format:
   [-]D.DDDDDDe[+/-]DDD

 For example:
   x = 3.1415; printf("x = %e\n",x)       yields:  x = 3.141500e+000

 * The "f" format code prints a number in floating-point format, in the default format:
   [-]D.DDDDDD

 For example:
   x = 3.1415; printf("x = %f\n",x)       yields:  f = 3.141500

 * The "g" format code prints a number in exponential or floating-point format, whichever is shortest.

 * A numeric string may be inserted between the "%" and the format code to specify greater control over the output format. For example:
   %3d
   %5.2f
   %08s
   %-8.4s

 This works as follows:

    •      The integer part of the number specifies the minimum "width", or number of  spaces, the output will use, though the output may exceed that width if it  is too long to fit.

    •      The fractional part of the number specifies either, for a string, the  maximum number of characters to be printed; or, for floating-point  formats, the number of digits to be printed to the right of the decimal  point.

    •      A leading "-" specifies left-justified output. The default is right-justified output.

    •      A leading "0" specifies that the output be padded with leading zeroes to  fill up the output field. The default is spaces.

 For example, consider the output of a string:
   x = "Baryshnikov"
   printf("[%3s]\n",x)          yields:       [Baryshnikov]
   printf("[%16s]\n",x)         yields:       [     Baryshnikov]
   printf("[%-16s]\n",x)        yields:       [Baryshnikov     ]
   printf("[%.3s]\n",x)         yields:       [Bar]
   printf("[%16.3s]\n",x)       yields:       [             Bar]
   printf("[%-16.3s]\n",x)      yields:       [Bar             ]
   printf("[%016s]\n",x)        yields:       [00000Baryshnikov]
   printf("[%-016s]\n",x)       yields:       [Baryshnikov     ]

 -- or an integer:
   x = 312
   printf("[%2d]\n",x)          yields:       [312]
   printf("[%8d]\n",x)          yields:       [     312]
   printf("[%-8d]\n",x)         yields:       [312     ]
   printf("[%.1d]\n",x)         yields:       [312]
   printf("[%08d]\n",x)         yields:       [00000312]
   printf("[%-08d]\n",x)        yields:       [312     ]

 -- or a floating-point number:
   x = 251.673209
   printf("[%2f]\n",x)          yields:       [251.67309]
   printf("[%16f]\n",x)         yields:       [      251.67309]
   printf("[%-16f]\n",x)        yields:       [251.67309      ]
   printf("[%.3f]\n",x)         yields:       [251.673]
   printf("[%16.3f]\n",x)       yields:       [        251.673]
   printf("[%016.3f]\n",x)      yields:       [00000000251.673]






----------

The keywords BEGIN and END are used to perform specific actions before and after reading the input lines. The BEGIN keyword is normally associated with printing titles and setting default values, whilst the END keyword is normally associated with printing totals

awk 'BEGIN {string = "Super" "power"; print string}'
this will print: Superpower

 For example, to extract and print the word "get" from "unforgettable":
   BEGIN {print substr("unforgettable",6,3)}

 Please be aware that the first character of the string is numbered "1", not "0". To extract a substring of at most ten characters, starting from position 6 of the first field variable, you use:
   substr($1,6,10)

----------


Escape sequences

Sequence   Description                  
\b          Backspace                      
\f          Formfeed                      
\n          Newline                      
\r          Carriage Return                  
\t          Horizontal tab
\"           Double quote
\a          The "alert" character; usually the ASCII BEL character
\v          Vertical tab
                  

example: awk '{ print $0 "\n"}' myfile
add a new empty line after each line


Regular Expressions
Pattern searching similar to grep and other unix utilities:
        /386/
        $1  ~  /386/
In regular expressions, the following symbols are metacharacters with special meanings.
        \  ^  $  .  [  ]  *  +  ?  (  )  |

        ^       matches the first character of a string
        $       matches the last character of a string
        .       matches a single character of a string
        [ ]     defines a set of characters
        ( )     used for grouping
        |       specifies alternatives

displays all which do not contain 2, 3, 4, 6 or 8 in first field
awk '$1  ~  /[^23468]/  { print $0 }'


How to hide special characters from the shell, this depends on the shell !
Preceding any single character with a backslash ('\') quotes that character.

Thus:
awk "BEGIN { print \"Don't Panic!\" }"
you get
tcsh: Unmatched '

but if you use bash, it works
With tcsh you need to write this:
awk 'BEGIN { print "Here is a single quote '\''" }'
the result is:
Here is a single quote '


Regular expressions are the extended kind found in egrep. They are composed of characters as follows:
 c
matches the character c (assuming c is a character with no special meaning in regexps).
 \c
matches the literal character c.
 .
matches any character except newline.
 ^
matches the beginning of a line or a string.
 $
matches the end of a line or a string.
 [abc...]
matches any of the characters abc... (character class).
 [^abc...]
matches any character except abc... and newline (negated character class).
 r1|r2
matches either r1 or r2 (alternation).
 r1r2
matches r1, and then r2 (concatenation).
 r+
matches one or more r's.
 r*
matches zero or more r's.
 r?
matches zero or one r's.
 (r)
matches r (grouping).




* The simplest kind search pattern that can be specified is a simple string, enclosed in forward-slashes ("/"). For example:
   /The/

 -- searches for any line that contains the string "The". This will not match "the" as Awk is "case-sensitive", but it will match words like "There" or "Them".

 This is the crudest sort of search pattern. Awk defines special characters or "metacharacters" that can be used to make the search more specific. For example, preceding the string with a "^" tells Awk to search for the string at the beginning of the input line. For example:
   /^The/

 -- matches any line that begins with the string "The". Similarly, following the string with a "$" matches any line that ends with "The", for example:
   /The$/

 But what if you actually want to search the text for a character like "^" or "$"? Simple, just precede the character with a backslash ("\"). For example:
   /\$/

 -- matches any line with a "$" in it.

 * Such a pattern-matching string is known as a "regular expression". There are many different characters that can be used to specify regular expressions. For example, it is possible to specify a set of alternative characters using square brackets ("[]"):
   /[Tt]he/

 This example matches the strings "The" and "the". A range of characters can also be specified. For example:
   /[a-z]/

 -- matches any character from "a" to "z", and:
   /[a-zA-Z0-9]/

 -- matches any letter or number.

 A range of characters can also be excluded, by preceding the range with a "^". For example:
   /^[^a-zA-Z0-9]/

 -- matches any line that doesn't start with a letter or digit.

 A "|" allows regular expressions to be logically ORed. For example:
   /(^Germany)|(^Netherlands)/

 -- matches lines that start with the word "Germany" or the word "Netherlands". Notice how parentheses are used to group the two expressions.

 * The "." special characters allows "wildcard" matching, meaning it can be used to specify any arbitrary character. For example:
   /wh./

 -- matches "who", "why", and any other string that has the characters "wh" and any following character.

 This use of the "." wildcard should be familiar to UN*X shell users, but awk interprets the "*" wildcard in a subtly different way. In the UN*X shell, the "*" substitutes for a string of arbitrary characters of any length, including zero, while in awk the "*" simply matches zero or more repetitions of the previous character or expression. For example, "a*" would match "a", "aa", "aaa", and so on. That means that ".*" will match any string of characters.

 There are other characters that allow matches against repeated characters expressions. A "?" matches zero or one occurrences of the previous regular expression, while a "+" matches one or more occurrences of the previous regular expression. For example:
   /^[+-]?[0-9]+$/

 -- matches any line that consists only of a (possibly signed) integer number. This is a somewhat confusing example and it is helpful to break it down by parts:
   /^                  Find string at beginning of line.
   /^[-+]?             Specify possible "-" or "+" sign for number.
   /^[-+]?[0-9]+       Specify one or more digits "0" through "9".
   /^[-+]?[0-9]+$/     Specify that the line ends with the number.



The search can be constrained to a single field within the input line. For example:
   $1 ~ /^France$/

 -- searches for lines whose first field ("$1" -- more on "field variables" later) is the word "France", while:
   $1 !~ /^Norway$/

 -- searches for lines whose first field is not the word "Norway".

 It is possible to search for an entire series or "block" of consecutive lines in the text, using one search pattern to match the first line in the block and another search pattern to match the last line in the block. For example:
   /^Ireland/,/^Summary/

 -- matches a block of text whose first line begins with "Ireland" and whose last line begins with "Summary".

   NF == 0

 -- matches all blank lines, or those whose number of fields is zero.
   $1 == "France"

 -- is a string comparison that matches any line whose first field is the string "France". The astute reader may notice that this example seems to do the same thing as a the previous example:
   $1 ~ /^France$/

 In fact, both examples do the same thing, but in the example immediately above the "^" and "$" metacharacters had to be used in the regular expression to specify a match with the entire first field; without them, it would match such strings as "FranceFour", "NewFrance", and so on. The string expression matches only to "France".

 * It is also possible to combine several search patterns with the "&&" (AND) and "||" (OR) operators. For example:
   ((NR >= 30) && ($1 == "France")) || ($1 == "Norway")

 -- matches any line past the 30th that begins with "France", or any line that begins with "Norway".

 * One class of pattern-matching that wasn't listed above is performing a numeric comparison on a field variable. It can be done, of course; for example:
   $1 == 100

 -- matches any line whose first field has a numeric value equal to 100. This is a simple thing to do and it will work fine. However, suppose you want to perform:
   $1 < 100

 This will generally work fine, but there's a nasty catch to it, which requires some explanation. The catch is that if the first field of the input can be either a number or a text string, this sort of numeric comparison can give crazy results, matching on some text strings that aren't equivalent to a numeric value.

 This is because awk is a "weakly-typed" language. Its variables can store a number or a string, with awk performing operations on each appropriately. In the case of the numeric comparison above, if $1 contains a numeric value, awk will perform a numeric comparison on it, as expected; but if $1 contains a text string, awk will perform a text comparison between the text string in $1 and the three-letter text string "100". This will work fine for a simple test of equality or inequality, since the numeric and string comparisons will give the same results, but it will give crazy results for a "less than" or "greater than" comparison.

 Awk is not broken; it is doing what it is supposed to do in this case. If this problem comes up, it is possible to add a second test to the comparison to determine if the field contains a numeric value or a text string. This second test has the form:
   (( $1 + 0 ) == $1 )

 If $1 contains a numeric value, the left-hand side of this expression will add 0 to it, and awk will perform a numeric comparison that will always be true.

 If $1 contains a text string that doesn't look like a number, for want of anything better to do awk will interpret its value as 0. This means the left-hand side of the expression will evaluate to zero; since there is a non-numeric text string in $1, awk will perform a string comparison that will always be false. This leads to a more workable comparison:
   ((( $1 + 0 ) == $1 ) && ( $1 > 100 ))




AWK Numerical Functions
Name    Function           
cos(x)     Cosine with x in radians       
exp
(x)     Exponent       
int
(x)     Integer part of x truncated towards 0        
log
(x)     Logarithm  (natural logarithm of x )
sin
(x)     Sine  with x in radians     
sqrt
(x)    Square Root   
atan2(y,x)   Arctangent of y/x in radians       
rand()    Random      
srand
(x)   Seed Random  
      

awk 'BEGIN { for (i = 1; i <= 7; i++) print int(101 * rand()) }'
This program prints 7 random numbers from 0 to 100, inclusive.

awk '{print sqrt($1)}' myfile
Print the square root for numbers in field 1


rand()
This gives you a random number. The values of rand are uniformly-distributed between 0 and 1. The value is never 0 and never 1.  Often you want random integers instead. Here is a user-defined function you can use to obtain a random nonnegative integer less than n:
function randint(n) {
     return int(n * rand())
}

 The multiplication produces a random real number greater than 0 and less than n. We then make it an integer (using int) between 0 and n - 1.  Here is an example where a similar function is used to produce random integers between 1 and n. Note that this program will print a new random number for each input record.
awk '
# Function to roll a simulated die.
function roll(n) { return 1 + int(rand() * n) }

# Roll 3 six-sided dice and print total number of points.
{
      printf("%d points\n", roll(6)+roll(6)+roll(6))
}'

 Note: rand starts generating numbers from the same point, or seed, each time you run awk. This means that a program will produce the same results each time you run it. The numbers are random within one awk run, but predictable from run to run. This is convenient for debugging, but if you want a program to do different things each time it is used, you must change the seed to a value that will be different in each run. To do this, use srand.
 srand(x)
The function srand sets the starting point, or seed, for generating random numbers to the value x.  Each seed value leads to a particular sequence of "random" numbers. Thus, if you set the seed to the same value a second time, you will get the same sequence of "random" numbers again.  If you omit the argument x, as in srand(), then the current date and time of day are used for a seed. This is the way to get random numbers that are truly unpredictable.  The return value of srand is the previous seed. This makes it easy to keep track of the seeds for use in consistently reproducing sequences of random numbers.



String Functions

     index(string,search)           
     length(string)               
     split(string,array,separator)   
     substr(string,position)          
     substr(string,position,max)      
     sub(regex,replacement)           
     sub(regex,replacement,string)    
     gsub(regex,replacement)           
     gsub(regex,replacement,string)   
     match(string,regex)           
     tolower(string)               
     toupper(string) 
     system(cmd-line)
             Execute the command cmd-line, and return the exit status             
    
Example
The string function gsub to replace each occurrence of 286 with the string AT
awk '{ gsub( /286/, "AT" ); print $0 }' myfile

awk '{print tolower($0)}' myfile

If myfile contains:
50.211 14.979 24.196
50.142 15.162 25.415

awk '{split($0,a," "); print a[1]}' myfile
will give :
50.211
50.142

if i do this only on line 1:
awk 'NR ==1 {split($0,a," "); print a[1]}' 2points
i get:
50.211


If the myfile contains:

Processing NGC 2345
awk '{print substr($0,12,8)}' myfile 
will give: NGC 2345

The "split()" function has the syntax:
   split(<string>,<array>,[<field separator>])

 This function takes a string with n fields and stores the fields into array[1], array[2], ... , array[n]. If the optional field separator is not specified, the value of FS (normally "white space", the space and tab characters) is used. For example, suppose we have a field of the form:
   joe:frank:harry:bill:bob:sil

 We could use "split()" to break it up and print the names as follows:
   my_string = "joe:frank:harry:bill:bob:sil";
   split(my_string,names,":");
   print names[1];
   print names[2];
   ...

 The "index()" function has the syntax:
   index(<target string>,<search string>)

 -- and returns the position at which the search string begins in the target string (remember, the initial position is "1"). For example:
   index("gorbachev","bach")         returns:  4
   index("superficial","super")      returns:  1
   index("sunfire","fireball")       returns:  0
   index("aardvark","z")             returns:  0




Match(string, regexp)
The match function searches the string, string, for the longest, leftmost substring matched by the regular expression, regexp. It returns the character position, or index, of where that substring begins (1, if it starts at the beginning of string). If no match if found, it returns 0.   The match function sets the built-in variable RSTART to the index. It also sets the built-in variable RLENGTH to the length in characters of the matched substring. If no match is found, RSTART is set to 0, and RLENGTH to -1.  For example:
awk '{
       if ($1 == "FIND")
         regex = $2
       else {
         where = match($0, regex)
         if (where)
           print "Match of", regex, "found at", where, "in", $0
       }
}'

This program looks for lines that match the regular expression stored in the variable regex. This regular expression can be changed. If the first word on a line is FIND, regex is changed to be the second word on that line. Therefore, given:
FIND fo*bar
My program was a foobar
But none of it would doobar
FIND Melvin
JF+KM
This line is property of The Reality Engineering Co.
This file created by Melvin.

awk prints:
Match of fo*bar found at 18 in My program was a foobar
Match of Melvin found at 26 in This file created by Melvin.

split(string, array, fieldsep)
This divides string into pieces separated by fieldsep, and stores the pieces in array. The first piece is stored in array[1], the second piece in array[2], and so forth. The string value of the third argument, fieldsep, is a regexp describing where to split string (much as FS can be a regexp describing where to split input records). If the fieldsep is omitted, the value of FS is used. split returns the number of elements created. The split function, then, splits strings into pieces in a manner similar to the way input lines are split into fields. For example:
split("auto-da-fe", a, "-")

splits the string `auto-da-fe' into three fields using `-' as the separator. It sets the contents of the array a as follows:
a[1] = "auto"
a[2] = "da"
a[3] = "fe"

The value returned by this call to split is 3.  As with input field-splitting, when the value of fieldsep is " ", leading and trailing whitespace is ignored, and the elements are separated by runs of whitespace.
 sprintf(format, expression1,...)
This returns (without printing) the string that printf would have printed out with the same arguments (see section Using printf Statements for Fancier Printing). For example:
sprintf("pi = %.2f (approx.)", 22/7)

 returns the string "pi = 3.14 (approx.)".
 sub(regexp, replacement, target)
The sub function alters the value of target. It searches this value, which should be a string, for the leftmost substring matched by the regular expression, regexp, extending this match as far as possible. Then the entire string is changed by replacing the matched text with replacement. The modified string becomes the new value of target.  This function is peculiar because target is not simply used to compute a value, and not just any expression will do: it must be a variable, field or array reference, so that sub can store a modified value there. If this argument is omitted, then the default is to use and alter $0.  For example:
str = "water, water, everywhere"
sub(/at/, "ith", str)

sets str to "wither, water, everywhere", by replacing the leftmost, longest occurrence of 'at' with 'ith'.  The sub function returns the number of substitutions made (either one or zero).  If the special character '&' appears in replacement, it stands for the precise substring that was matched by regexp. (If the regexp can match more than one string, then this precise substring may vary.) For example:
awk '{ sub(/candidate/, "& and his wife"); print }'

changes the first occurrence of 'candidate' to 'candidate and his wife' on each input line.  Here is another example:
awk 'BEGIN {
        str = "daabaaa"
        sub(/a*/, "c&c", str)
        print str
}'

 prints 'dcaacbaaa'. This show how '&' can represent a non-constant string, and also illustrates the "leftmost, longest" rule.  The effect of this special character ('&') can be turned off by putting a backslash before it in the string. As usual, to insert one backslash in the string, you must write two backslashes. Therefore, write '\\&' in a string constant to include a literal '&' in the replacement. For example, here is how to replace the first `|' on each line with an '&':
awk '{ sub(/\|/, "\\&"); print }'

Note: as mentioned above, the third argument to sub must be an lvalue. Some versions of awk allow the third argument to be an expression which is not an lvalue. In such a case, sub would still search for the pattern and return 0 or 1, but the result of the substitution (if any) would be thrown away because there is no place to put it. Such versions of awk accept expressions like this:
sub(/USA/, "United States", "the USA and Canada")

 But that is considered erroneous in gawk.
 gsub(regexp, replacement, target)
 This is similar to the sub function, except gsub replaces all of the longest, leftmost, nonoverlapping matching substrings it can find. The 'g' in gsub stands for "global," which means replace everywhere. For example:
awk '{ gsub(/Britain/, "United Kingdom"); print }'

 replaces all occurrences of the string 'Britain' with 'United Kingdom' for all input records. The gsub function returns the number of substitutions made. If the variable to be searched and altered, target, is omitted, then the entire input record, $0, is used. As in sub, the characters '&' and '\' are special, and the third argument must be an lvalue.
 substr(string, start, length)
 This returns a length-character-long substring of string, starting at character number start. The first character of a string is character number one. For example, substr("washington", 5, 3) returns "ing". If length is not present, this function returns the whole suffix of string that begins at character number start. For example, substr("washington", 5) returns "ington". This is also the case if length is greater than the number of characters remaining in the string, counting from character number start.
 tolower(string)
 This returns a copy of string, with each upper-case character in the string replaced with its corresponding lower-case character. Nonalphabetic characters are left unchanged. For example, tolower("MiXeD cAsE 123") returns "mixed case 123".
 toupper(string)
 This returns a copy of string, with each lower-case character in the string replaced with its corresponding upper-case character. Nonalphabetic characters are left unchanged. For example, toupper("MiXeD cAsE 123") returns "MIXED CASE 123".


The Match function
      getline                  
      getline <file             
      getline variable              
      getline variable <file    
awk provides the function getline to read input from the current input file or from a file or pipe.
getline reads the next input line, splitting it into fields according to the settings of NF, NR and FNR. It returns 1 for success, 0 for end-of-file, and -1 on error.
The statement
        getline < "temp.dat"
reads the next input line from the file "temp.dat", field splitting is performed, and NF is set.

The statement
        getline data < "temp.dat"
reads the next input line from the file "temp.dat" into the user defined variable data, no field splitting is done, and NF, NR and FNR are not altered.

You can take input from keyboard while running awk script, try the following awk script:
      awk 'BEGIN {print "your name"; getline na <"-"; print "my name is " na}'
Here getline function is used to read input from keyboard and then assign the data (inputted from keyboard) to variable.
Syntax:
getline variable-name < "-"
|              |                      |
1            2                     3

1 --> getline is function name
2 --> variable-name is used to assign the value read from input
3 --> Means read from stdin (keyboard)



Function Definition Example
Here is an example of a user-defined function, called myprint, that takes a number and prints it in a specific format.
function myprint(num)
{
     printf "%6.3g\n", num
}


 To illustrate, here is an awk rule which uses our myprint function:
$3 > 0     { myprint($3) }


 This program prints, in our special format, all the third fields that contain a positive number in our input. Therefore, when given:
 1.2   3.4    5.6   7.8
 9.10 11.12 -13.14 15.16
17.18 19.20  21.22 23.24


 this program, using our function to format the results, prints:
   5.6
  21.2


 Here is an example of a recursive function. It prints a string backwards:
function rev (str, len) {
    if (len == 0) {
        printf "\n"
        return
    }
    printf "%c", substr(str, len, 1)
    rev(str, len - 1)
}

Calling User-defined Functions
Calling a function means causing the function to run and do its job. A function call is an expression, and its value is the value returned by the function.
A function call consists of the function name followed by the arguments in parentheses. What you write in the call for the arguments are awk expressions; each time the call is executed, these expressions are evaluated, and the values are the actual arguments. For example, here is a call to foo with three arguments (the first being a string concatenation):
foo(x y, "lose", 4 * z)

Caution: whitespace characters (spaces and tabs) are not allowed between the function name and the open-parenthesis of the argument list. If you write whitespace by mistake, awk might think that you mean to concatenate a variable with an expression in parentheses. However, it notices that you used a function name and not a variable name, and reports an error.

When a function is called, it is given a copy of the values of its arguments. This is called call by value. The caller may use a variable as the expression for the argument, but the called function does not know this: it only knows what value the argument had. For example, if you write this code:
foo = "bar"
z = myfunc(foo)

then you should not think of the argument to myfunc as being "the variable foo." Instead, think of the argument as the string value, "bar".

If the function myfunc alters the values of its local variables, this has no effect on any other variables. In particular, if myfunc does this:
function myfunc (win) {
  print win
  win = "zzz"
  print win
}

to change its first argument variable win, this does not change the value of foo in the caller. The role of foo in calling myfunc ended when its value, "bar", was computed. If win also exists outside of myfunc, the function body cannot alter this outer value, because it is shadowed during the execution of myfunc and cannot be seen or changed from there.

However, when arrays are the parameters to functions, they are not copied. Instead, the array itself is made available for direct manipulation by the function. This is usually called call by reference. Changes made to an array parameter inside the body of a function are visible outside that function.  This can be very dangerous if you do not watch what you are doing. For example:
function changeit (array, ind, nvalue) {
     array[ind] = nvalue
}

BEGIN {
           a[1] = 1 ; a[2] = 2 ; a[3] = 3
           changeit(a, 2, "two")
           printf "a[1] = %s, a[2] = %s, a[3] = %s\n", a[1], a[2], a[3]
      }


 prints 'a[1] = 1, a[2] = two, a[3] = 3', because calling changeit stores "two" in the second element of a.


The return Statement
The body of a user-defined function can contain a return statement. This statement returns control to the rest of the awk program. It can also be used to return a value for use in the rest of the awk program. It looks like this:
return expression


 The expression part is optional. If it is omitted, then the returned value is undefined and, therefore, unpredictable.

 A return statement with no value expression is assumed at the end of every function definition. So if control reaches the end of the function body, then the function returns an unpredictable value.  awk will not warn you if you use the return value of such a function; you will simply get unpredictable or unexpected results.

Here is an example of a user-defined function that returns a value for the largest number among the elements of an array:
function maxelt (vec,   i, ret) {
     for (i in vec) {
          if (ret == "" || vec[i] > ret)
               ret = vec[i]
     }
     return ret
}


You call maxelt with one argument, which is an array name. The local variables i and ret are not intended to be arguments; while there is nothing to stop you from passing two or three arguments to maxelt, the results would be strange. The extra space before i in the function parameter list is to indicate that i and ret are not supposed to be arguments. This is a convention which you should follow when you define functions.

Here is a program that uses our maxelt function. It loads an array, calls maxelt, and then reports the maximum number in that array:
awk '
function maxelt (vec,   i, ret) {
     for (i in vec) {
          if (ret == "" || vec[i] > ret)
               ret = vec[i]
     }
     return ret
}

# Load all fields of each record into nums.
{
          for(i = 1; i <= NF; i++)
               nums[NR, i] = $i
}

END {
     print maxelt(nums)
}'


 Given the following input:
 1 5 23 8 16
44 3 5 2 8 26
256 291 1396 2962 100
-6 467 998 1101
99385 11 0 225

the program tells us that:
99385
is the largest number in our array.



awk Control Flow Statements
if (  expression )  statement1 else  statement2

while (  expression )   statement

for (  expression1;  expression;  expression2 )  statement


 The syntax of "if ... else" is:
   if (<condition>) <action 1> [else <action 2>]

 The "else" clause is optional. The "condition" can be any expression discussed in the section on pattern matching, including matches with regular expressions.
For example, consider the following Awk program:
   {if ($1=="green") print "GO";
    else if ($1=="yellow") print "SLOW DOWN";
    else if ($1=="red") print "STOP";
    else print "WHAT";}


The syntax for "while" is:
   while (<condition>) <action>

 The "action" is performed as long the "condition" tests true, and the "condition" is tested before each iteration. The conditions are the same as for the "if ... else" construct. For example, since by default an Awk variable has a value of 0, the following Awk program could print the numbers from 1 to 20:
   BEGIN {while(++x<=20) print x}

 * The "for" loop is more flexible. It has the syntax:
   for (<initial action>;<condition>;<end-of-loop action>) <action>

 For example, the following "for" loop prints the numbers 10 through 20 in increments of 2:
   BEGIN {for (i=10; i<=20; i+=2) print i}

 This is equivalent to:
   i=10
   while (i<=20) {
      print i;
      i+=2;}

The "for" loop has an alternate syntax, used when scanning through an array:
   for (<variable> in <array>) <action>

 with the example:
   my_string = "joe:frank:harry:bill:bob:sil";
   split(my_string, names, ":");

 -- then the names could be printed with the following statement:
   for (idx in names) print idx, names[idx];

 This yields:
   2 frank
   3 harry
   4 bill
   5 bob
   6 sil
   1 joe

 Notice that the names are not printed in the proper order. One of the characteristics of this type of "for" loop is that the array is not scanned in a predictable order.


Awk defines three unconditional control statements: "break", "continue", "next", and "exit". "Break" and "continue" are strictly associated with the "while" and "for" loops:

    •      break: Causes a jump out of the loop.

    •      continue: Forces the next iteration of the loop.

 "Next" and "exit" control Awk's input scanning:

    •      next: Causes Awk to immediately get another line of input and begin scanning it from the first match statement.

    •      exit: Causes Awk to end reading its input and execute END operations,  if any are specified.




Limits
Each implementation of awk imposes some limits. Below are typical limits
        100 fields
        2500 characters per input line
        2500 characters per output line
        1024 characters per individual field
        1024 characters per printf string
        400 characters maximum quoted string
        400 characters in character class
        15 open files
        1 pipe




EXAMPLES
There are millions of different ways to do things...here are few examples



The simplest action is to print some  or all  of  a record; this is accomplished by the awk command print.
The awk program

awk '{ print }' myfile
Prints each record
while
awk '{print $2, $1}' myfile
prints the first two fields in reverse order but
awk '{print $1 $2}' myfile
will group the 2 fields

awk '{ print $1 >"foo1"; print $2 >"foo2" }' myfile
will put the data into file foo1 and file foo2


The  variables OFS and ORS may be used to change the current output field separator and output record separator.   The  output record  separator  is  appended to the output of the print statement.
Awk also provides the printf statement  for  output  formatting.

BEGIN and END
The special pattern  BEGIN  matches  the  beginning  of  the input,  before the first record is read.  The pattern END matches the end of the input, after the last record has  been  processed. BEGIN and END thus provide a way to gain control before and after processing.


-------

awk '{ if ($2 =="0.5") {print $0} }' myfile
prints the lines for which field 2 = 0.5

Test count things:

awk 'BEGIN {counter = 0} {if ($2 == "0.5"){counter++}} END {print counter} '  myfile
this tells me how many times field 2 has a value of 0.5



Using Awk to create a simple histogram


We have a file with scores in a file called mydata
r   0.2     99
r   0.1     88
r   0.4     76
r   0.1     76
r   0.2     56
r   0.3     900
r   0.2     43
r   0.5     5
r   0.5     9
r   0.6     56
r   0.8     43
r   0.7     33
r   0.9     10



we can sort the second column with :

sort +1 -n mydata > mydata_sorted

this gives:
r   0.1     76
r   0.1     88
r   0.2     43
r   0.2     56
r   0.2     99
r   0.3     900
r   0.4     76
r   0.5     5
r   0.5     9
r   0.6     56
r   0.7     33
r   0.8     43
r   0.9     10


(sorting can be done in descending (reverse) order with sort -nr)




You can put the following lines in a file called histo.txt
to Print frequency histogram of column (field 2) of numbers
$2 <= 0.1 {na=na+1}
($2 > 0.1) && ($2 <= 0.2) {nb = nb+1}
($2 > 0.2) && ($2 <= 0.3) {nc = nc+1}
($2 > 0.3) && ($2 <= 0.4) {nd = nd+1}
($2 > 0.4) && ($2 <= 0.5) {ne = ne+1}
($2 > 0.5) && ($2 <= 0.6) {nf = nf+1}
($2 > 0.6) && ($2 <= 0.7) {ng = ng+1}
($2 > 0.7) && ($2 <= 0.8) {nh = nh+1}
($2 > 0.8) && ($2 <= 0.9) {ni = ni+1}
($2 > 0.9) {nj = nj+1}
END {print na, nb, nc, nd, ne, nf, ng, nh, ni, nj, NR}

and run

awk -f histo.txt mydata_sorted
this will give:
2 3 1 1 2 1 1 1 1  13
meaning, 0.1 occurs twice, 0.9, once


-------


PDB file
Count residue in PDB file
awk 'BEGIN{
counter = 0
}
{
if ($3 == "CA"){
counter++
}
}
END{
print counter
}



To select lines with Atom and Ca and get the amino acid name
awk '$1 == "ATOM" && $3 == "CA" {print $4}' mypdb.pdb

or

awk '$1=="ATOM" ! $1=="HETATM"' my.pdb | grep CA | awk '{print $4}' > myoutput.pdb


Get the sequence from a PDB file, warning might be some strange aa...
awk '$1 == "ATOM" && $3 == "CA" {print $4}' mypdb.pdb | awk ' { gsub( /VAL/, "V"); gsub( /GLY/, "G"); gsub( /ALA/, "A"); gsub( /LEU/, "L"); gsub( /ILE/, "I"); gsub( /SER/, "S"); gsub( /THR/, "T"); gsub( /ASP/, "D"); gsub( /ASN/, "N"); gsub( /LYS/, "K"); gsub( /GLU/, "E"); gsub( /GLN/, "Q"); gsub( /ARG/, "R"); gsub( /HIS/, "H"); gsub( /PHE/, "F"); gsub( /CYS/, "C"); gsub( /TRP/, "W"); gsub( /TYR/, "Y"); gsub( /MET/, "M"); gsub( /PRO/, "P"); residues = residues $1} END {print residues }' > myoutput.seq

 



Distance between 2 atoms in a PDB file - test1 - I try something simple
if we have the x, y, z coordinates of 2 atoms in a file

    x         y            z
50.211 14.979 24.196   my atom x1
50.142 15.162 25.415   my atom x2


The distance between these 2 atoms is:
sqrt((x1-x2)^2 + (y1-y2)^2 + (z1-z2)^2)

awk '{printf "%s\t", $0}' mycoordinates > mypoints
to make 1 row of many columns

I get this in my file test1:

   $1          $2          $3          $4        $5         $6         
   x1          y1         z1           x2        y2         z2
50.211   14.979   24.196    50.142   15.162   25.415


awk '{ a=sqrt(($1-$4)^2 + ($2-$5)^2 + ($3-$6)^2); print a}' test1
result is 1.23459



Distance between 2 atoms in a PDB file - test2 
Now, if i have in my file test2:
50.211 14.979 24.196   my atom x1
50.142 15.162 25.415   my atom x2


I can for example do for field 1:
thus the file contains
50.211
50.142

awk 'NR ==1 {split($0,a," ")}; NR == 2 {split($0,b," ")} END {print sqrt((a[1]-b[1])^2)}' test2


for the complete file test2:
awk 'NR ==1 {split($0,a," ")}; NR == 2 {split($0,b," ")} END {print sqrt((a[1]-b[1])^2 + (a[2]-b[2])^2 + (a[3]-b[3])^2)}' test2

the result is: 1.23459


with:
awk 'NR ==1 {split($0,a," ")}; NR == 2 {split($0,b," ")} END {printf "%.3f\n", sqrt((a[1]-b[1])^2 + (a[2]-b[2])^2 + (a[3]-b[3])^2)}'
result is : 1.235


-------


Using getline

if in a file called molecule.txt i have:
ID 44000
jkjk
jkk
END

ID 400009
jkkk
mvnvbc
END

ID 58939
jkd
jkjd
END

ID 400009
jj
END

Thus four molecules, ending with END and with an empty line at the end of the file

and a file called molecule_ID_numbers.txt with 4 lines and 4 numbers:
7888
9000
10000
15000



I can use getline var
This form of the getline function takes its input from the file molecule_ID_numbers.txt and puts it in the variable var.  The following program reads its input record from the file molecule.txt when it encounters a first field with a value equal to ID in the current input file. It also adds an empty line after  END and <CAS NUMBER> and the value in molecule_ID_numbers.txt. We should better have the same number of ID and number in the molecule_ID_numbers.txt  file. The empty line at the end of molecule.txt is important.

awk '{ if ($1 == ID) {getline var < "molecule_ID_numbers.txt" ; print "\n", "\<CAS NUMBER\>" "\n", var, "\n\n"} else print}' < "molecule.txt" > myout


The output is in myOUT:
ID 44000
jkjk
jkk
END

 <CAS NUMBER>
 7888


ID 400009
jkkk
mvnvbc
END

 <CAS NUMBER>
 9000


ID 58939
jkd
jkjd
END

 <CAS NUMBER>
 10000


ID 400009
jj
END

 <CAS NUMBER>
 15000



-------

COUNTING STUFF

awk 'BEGIN{
counter = 0
}
{
if ($3 >= 0 && $3 <= 1 ){
counter++
}
}
END{
print "Scores_0_to_1    " counter
}' mylistwithnameshort_test.txt

awk 'BEGIN{
counter = 0
}
{
if ($3 >= 1 && $3 <= 2 ){
counter++
}
}
END{
print "Scores_1_to_2    " counter
}' mylistwithnameshort_test.txt

awk 'BEGIN{
counter = 0
}
{
if ($3 >= 2 && $3 <= 3 ){
counter++
}
}
END{
print "Scores_2_to_3    " counter
}' mylistwithnameshort_test.txt



awk 'BEGIN{
counter = 0
}
{
if ($3 >= 3 && $3 <= 4 ){
counter++
}
}
END{
print "Scores_3_to_4    " counter
}' mylistwithnameshort_test.txt


awk 'BEGIN{
counter = 0
}
{
if ($3 >= 4 && $3 <= 5 ){
counter++
}
}
END{
print "Scores_4_to_5    " counter
}' mylistwithnameshort_test.txt


awk 'BEGIN{
counter = 0
}
{
if ($3 >= 5 && $3 <= 6 ){
counter++
}
}
END{
print "Scores_5_to_6    " counter
}' mylistwithnameshort_test.txt



awk 'BEGIN{
counter = 0
}
{
if ($3 >= 6 && $3 <= 7 ){
counter++
}
}
END{
print "Scores_6_to_7    " counter
}' mylistwithnameshort_test.txt




awk 'BEGIN{
counter = 0
}
{
if ($3 >= 7 && $3 <= 8 ){
counter++
}
}
END{
print "Scores_7_to_8    " counter
}' mylistwithnameshort_test.txt




awk 'BEGIN{
counter = 0
}
{
if ($3 >= 8 && $3 <= 9 ){
counter++
}
}
END{
print "Scores_8_to_9    " counter
}' mylistwithnameshort_test.txt



awk 'BEGIN{
counter = 0
}
{
if ($3 >= 9 && $3 <= 10 ){
counter++
}
}
END{
print "Scores_9_to_10   " counter
}' mylistwithnameshort_test.txt




awk 'BEGIN{
counter = 0
}
{
if ($3 >= 10 && $3 <= 11 ){
counter++
}
}
END{
print "Scores_10_to_11  " counter
}' mylistwithnameshort_test.txt



awk 'BEGIN{
counter = 0
}
{
if ($3 >= 11 && $3 <= 12 ){
counter++
}
}
END{
print "Scores_11_to_12  " counter
}' mylistwithnameshort_test.txt



awk 'BEGIN{
counter = 0
}
{
if ($3 >= 12 && $3 <= 13 ){
counter++
}
}
END{
print "Scores_12_to_13  " counter 
}' mylistwithnameshort_test.txt



awk 'BEGIN{
counter = 0
}
{
if ($3 >= 13 && $3 <= 14 ){
counter++
}
}
END{
print "Scores_13_to_14  "     counter 
}' mylistwithnameshort_test.txt




awk 'BEGIN{
counter = 0
}
{
if ($3 >= 14 && $3 <= 15 ){
counter++
}
}
END{
print "Scores_14_to_15  "     counter 
}' mylistwithnameshort_test.txt



awk 'BEGIN{
counter = 0
}
{
if ($3 >= 15 && $3 <= 20 ){
counter++
}
}
END{
print "Scores_15_to_20  "     counter 
}' mylistwithnameshort_test.txt


-------


echo "the script starts"
echo "check that each compound starts with ISIS"
echo -n "select the SDF file:"
read file1
echo "the name of the file is $file1"
tr '\r' '\n' < "$file1" > "tmp1_unix.sdf"
echo "step 1"
awk 'NF > 0' < "tmp1_unix.sdf" > "tmp2_unix_no_emptylines.sdf"
echo "this is done"


-------
Rename files
#!/bin/sh
# we have less than 3 arguments. Print the help text:
if [ $# -lt 3 ] ; then
cat <<HELP
ren -- renames a number of files using sed regular expressions

USAGE: ren 'regexp' 'replacement' files...

EXAMPLE: rename all *.HTM files in *.html:
  ren 'HTM$' 'html' *.HTM

HELP
  exit 0
fi
OLD="$1"
NEW="$2"
# The shift command removes one argument from the list of
# command line arguments.
shift
shift
# $* contains now all the files:
for file in $*; do
    if [ -f "$file" ] ; then
      newfile=`echo "$file" | sed "s/${OLD}/${NEW}/g"`
      if [ -f "$newfile" ]; then
        echo "ERROR: $newfile exists already"
      else
        echo "renaming $file to $newfile ..."
        mv "$file" "$newfile"
      fi
    fi
done


Rename files
myfiles=`ls toto*`
for i in $myfiles ; do

echo "mv $i $i"\.smi
done



-------

Some other examples

# Print first two fields in opposite order:
  awk '{ print $2, $1 }' file


# Print lines longer than 72 characters:
  awk 'length > 72' file


# Print length of string in 2nd column
  awk '{print length($2)}' file


# Add up first column, print sum and average:
       { s += $1 }
  END  { print "sum is", s, " average is", s/NR }


# Print fields in reverse order:
  awk '{ for (i = NF; i > 0; --i) print $i }' file


# Print the last line
      {line = $0}
  END {print line}


# Print the total number of lines that contain the word Pat
  /Pat/ {nlines = nlines + 1}
  END {print nlines}


# Print all lines between start/stop pairs:
  awk '/start/, /stop/' file


# Print all lines whose first field is different from previous one:
  awk '$1 != prev { print; prev = $1 }' file


# Print column 3 if column 1 > column 2:
  awk '$1 > $2 {print $3}' file


# Print line if column 3 > column 2:
  awk '$3 > $2' file


# Count number of lines where col 3 > col 1
  awk '$3 > $1 {print i + "1"; i++}' file


# Print sequence number and then column 1 of file:
  awk '{print NR, $1}' file


# Print every line after erasing the 2nd field
  awk '{$2 = ""; print}' file


# Print hi 28 times
  yes | head -28 | awk '{ print "hi" }'


# Print hi.0010 to hi.0099 (NOTE IRAF USERS!)
  yes | head -90 | awk '{printf("hi00%2.0f \n", NR+9)}'

# Print out 4 random numbers between 0 and 1
yes | head -4 | awk '{print rand()}'

# Print out 40 random integers modulo 5
yes | head -40 | awk '{print int(100*rand()) % 5}'


# Replace every field by its absolute value
  { for (i = 1; i <= NF; i=i+1) if ($i < 0) $i = -$i print}

# If you have another character that delimits fields, use the -F option
# For example, to print out the phone number for Jones in the following file,
# 000902|Beavis|Theodore|333-242-2222|149092
# 000901|Jones|Bill|532-382-0342|234023
# ...
# type
  awk -F"|" '$2=="Jones"{print $4}' filename



# Some looping commands
# Remove a bunch of print jobs from the queue
  BEGIN{
    for (i=875;i>833;i--){
        printf "lprm -Plw %d\n", i
    } exit
       }


 Formatted printouts are of the form printf( "format\n", value1, value2,
... valueN)
        e.g. printf("howdy %-8s What it is bro. %.2f\n", $1, $2*$3)
    %s = string
    %-8s = 8 character string left justified
     %.2f = number with 2 places after .
    %6.2f = field 6 chars with 2 chars after .
    \n is newline
    \t is a tab