Awk...short intro
Oct 13, 2006
Awk -- A Pattern Scanning and Processing Language
There are three variations of AWK:
AWK - the original from AT&T
NAWK - A newer, improved version from AT&T
GAWK - The Free Software foundation's version
It was created in late 70th of the last century. The original book that describes AWK is from Alfred V. Aho, Brian W.
Kernighan, and Peter J. Weinberger The Awk Programming Language,
Addison-Wesley, 1988.
There are several things in Awk compared to Perl or others:
- awk is simpler
- awk syntax is far more regular
- you may already know awk well enough for the task at hand
- awk can be smaller and much quicker to execute for small programs
- yet, it is not suited for everything, as you may guess
See more about Awk here or here
Intro to shell scripting here
NB: Apple file
tr '\r' '\n' < macfile.txt > unixfile.txt
or
awk '{ gsub("\r", "\n"); print $0;}' macfile.txt > unixfile.txt
To convert a Unix file to Mac OS using awk, at the command line, enter:
awk '{ gsub("\n", "\r"); print $0;}' unixfile.txt > macfile.txt
Delete BOTH leading and trailing whitespace from each line
sed 's/^[ ^t]*//;s/[ ^]*$//' file
Simple sed for html
sed -e 's/target="_blank">http.*</target="_blank">LINK<\/a></g' my.html > my2.html
this runs with sed on mac, warning with the different OS
------AWK------
An awk program is a sequence of statements of the form:
pattern { action }
pattern { action }
Each line of input is matched against each of
the patterns. For each pattern that matches, the
associated action is
executed. When all the patterns have been tested, the next line is fetched and the matching starts over.
Examples
awk 'NR == 1 {print $2}' mydatatfile
This says select line one and do action {..}, here print field 2
awk 'length > 10' myfile
This prints every line that is longer than 10 characters
awk '{ $1 = log($1); print }' myfile
Replaces the first field of each line by its logarithm
awk '$2 ~ /A|B|C/' myfile
Prints all input lines with an A, B, or C in the second field
awk '$2 ~/0.1/' myfile > myoutput
Prints all lines with 0.1 in the second field and copy them in file myoutput
awk '{print "end"; print $0}' myfile
This prints end in between each line
awk '{print "\$" $0}' myfile
prints $ infront each line
sed 's/\$/\"/'
substitutes each $ by "
sed 's/\> \<ID\>/\> \<ID_STRUCTURE\>/g' file_input.sdf > fileoutput.sdf
substitutes each > <ID> by > <ID_STRUCTURE> for LigandInfo for example
This prints the last field of each line
awk '{ print $NF }' myfile
Print the last field of the last line
awk '{ field = $NF }; END{ print field }' myfile
Print every line with more than 4 fields
awk 'NF > 4' myfile
Print every line where the value of the last field is > 4
awk '$NF > 4' myfile
Some other examples
awk 'BEGIN {i=1; while (i<=10){ print i*i; i++}}'
awk 'BEGIN {col = 13; {print col}}'
awk 'BEGIN {lines=0} {lines++} END {print lines}' myfile
(somehow like wc -l)
Change the field separator:
if in myfile i have:
uuuu:kkkk:lllll5676
uuuu:kkkk:lllll8999
uuuu:kkkk:lllll00999
awk 'BEGIN {FS=":"} {print $2}' myfile
I force the separator to be : and I keep field 2
awk '$3 == 0 {print $1}' myfile1 myfile2
If field 3 = 0, print field 1 of my two files
awk '$2 > 0.5 {col = col +1} END {print col}'
prints the number of time you have numbers above 0.5 in field 2
Print lines with the word "brian" in them:
awk '/brian/ { print $0 }' myfile
Print each input line preceded with a line number
print the heading which includes the name of the file
awk 'BEGIN { print "File:", FILENAME } { print NR, ":\t", $0 }' myfile
awk '{name = name $1} END {print name}'
Print all on one line
Insert 5 blank spaces at beginning of each line
awk '{sub(/^/, " ");print}'
Substitute "foo" with "bar" EXCEPT for lines which contain "baz"
awk '!/baz/{gsub(/foo/, "bar")};{print}'
Change "scarlet" or "ruby" or "puce" to "red"
awk '{gsub(/scarlet|ruby|puce/, "red"); print}'
Remove duplicate, consecutive lines (emulates "uniq")
awk 'a !~ $0; {a=$0}'
Remove duplicate, nonconsecutive lines
awk '! a[$0]++'
# most concise script
awk '!($0 in a) {a[$0];print}' # most efficient script
Print the first line
awk 'NR <2' test
Print the last line of a file (emulates "tail -1")
awk 'END{print}'
Print only lines which match regular expression (emulates "grep")
awk '/regex/'
Print only lines which do NOT match regex (emulates "grep -v")
awk '!/regex/'
Print section of file from regular expression to end of file
awk '/regex/,0'
awk '/regex/,EOF'
Print section of file based on line numbers (lines 8-12, inclusive)
awk 'NR==8,NR==12'
Print line number 52
awk 'NR==52'
awk 'NR==52 {print;exit}' # more efficient on large files
Print section of file between two regular expressions (inclusive)
awk '/Iowa/,/Montana/' # case sensitive
Delete ALL blank lines from a file (same as "grep '.' ")
awk NF myfile > myoutput
awk '/./'
awk '{n=5 ; print $n}' myinput
prints the fifth field in the input record
To list a file, but skip over all the blank lines at the start of the file, use the command:
awk "/[^ ]/ { copy=1 }; copy { print }" filename.ext
To list all the lines of a file (TMP.LOG), except those containing the string "Frame overlap", you can use the command:
awk "!/Frame overlap/" TMP.LOG
Adding a blank line after lines in a list
To add a blank line after all lines containing "util.h":
awk "{ print $0; if ( $0 ~ /util.h/) print \"\" }" TMP.TMP
expression: none do this for all lines
action: print $0; if ( $0 ~ /util.h/ ) print "" print the line, then if the line contains "util.h" print a blank line
to add a blank line at the end of a file:
awk '{print $0} END {print ""}' myfile.txt > myoutput.txt
sed '
/$$$$/ {
N
/\n.*ISIS/ {
s/$$$$.*\n.*ISIS/$$$$ ISIS/
}
}'
awk 'BEGIN {for (x=1; x<=50; ++x) {printf("%3d\n",x) >> "tfile"}}'
dumps the numbers from 1 to 50 into "tfile".
Output can also be "piped" into another utility with the "|"
("pipe") operator. One can pipe output to the "tr" ("translate")
utility to convert it to upper-case:
awk 'BEGIN { print "this is a test"}' | tr "[a-z]" "[A-Z]"
yields:
THIS IS A TEST
HOW TO RUN AWK
-You can type: awk '{.....}' myinputfile > myoutputfile
or >> myoutputfile to append to an existing file
-You can put the script in a file and run: awk -f myscript myfile...and many other ways around
things can depend on the shell you are using, sh, csh, tcsh, bash...so watch out the behavior
-F fs
Sets the FS variable to fs (see section Specifying how Fields are Separated).
-f source-file
Indicates that the awk program is to be found in source-file instead of in the first non-option argument.
-v var=val
Sets the variable var to the value val before execution of the
program begins. Such variable values are available inside the BEGIN
rule (see below for a fuller explanation). The `-v' option can
only set one variable, but you can use it more than once, setting
another variable each time, like this: `-v foo=1 -v bar=2'.
ESSENTIAL SYNTAX
Arithmetic
Operator Type Meaning
+ Arithmetic Addition
- Arithmetic Subtraction
* Arithmetic Multiplication
/ Arithmetic Division
% Arithmetic Modulo
++ increment
-- decrement
^ exponential
+= plus equals
-= minus equals
*= multiply equals
/= divide equals
%= modulus equals
^= exponential equals
awk 'BEGIN {OFS="\t"} {print $1,$2,$3,(($1+$2+$3)/3)}' IN > OUT
Will print out column 1, column2, column 3, and the mean of the 3 columns
awk '{print ($1/66),($2/6430),($3/627)}' IN>OUT
Dividing each column by different numbers
awk '{printf "%.3f\t%.3f\t%.3f\n", ($1/66),($2/6430),($3/627)}' IN>OUT
printf to format 3 significant digits and separate with tabs
Conditional expressions
Operator Meaning
== Is
equal
!= Is not equal to
> Is greater than
>= Is greater than or equal to
< Is less than
<= Is less than or equal to
Regular Expression Operators
Operator Meaning
~ Matches
!~ Doesn't match
ex: word !~ /START/
AND and OR and not matching
&& and || and !
Built-in VARIABLES
Records and Fields
Awk input is divided into records terminated by a record
separator. The default record
separator is a newline, so by default awk processes its input a
line at a time. The number of the current record is
available in a variable named NR.
Each input record is
considered to be divided into fields.
Fields are normally separated by white
space --blanks or tabs -- but the input field
separator may be changed. Fields are referred to as
$1, $2, and so forth, where $1
is the first field, and $0 is the whole input record
itself. Fields may be assigned too. The number of
fields in the current record is available in a variable named NF.
The variables FS and RS refer to the input field and record
separators; they may be changed at any time to any single
character. The optional command-line argument -Fc may also
be used to set FS to the character c.
The variable FILENAME contains the name of the current input file.
print $0, this prints the full line, if it has 8 fields, it does equivalent to:
print $1, $2, $3, $4, $5, $6, $7, $8
FS - The Input Field Separator
The input field separator, a blank by default
FNR
The input record number in the current input file
OFS - The Output Field Separator
The output field separator, a blank by default
ORS - The Output line record Separator
The output record separator, by default a newline
NF - The Number of Fields
Awk counts the number of fields in the input line and put it into a variable called NF.
awk '{print NF, $NF}' myinput
print the number of field and the last field of each line
NR - The Number of Records - the current input line number
Awk counts the number of lines it reads
awk '{print NR, $0}'
This prints the line number and the complete line
RS - The input line Record Separator (default = newline)
The input record separator, by default a newline
Change the RS
'BEGIN {
# change the record separator from newline to nothing
RS=""
# change the field separator from whitespace to newline
FS="n"
}
{
# print the second and third line of the file
print $2, $3}' myfile
if myfile is:
50.211 14.979 24.196
50.142 15.162 25.415
awk 'BEGIN {RS=""; FS="n"} {print $2}' myfile
gives:
50.142 15.162 25.415
Arrays
awk provides single dimensioned arrays. Arrays need not
be declared, they are created in the same manner as awk user defined
variables.
Elements can be specified as numeric or string values.
Length
this counts the number of characters in a string
awk '{print length($0)}' myfile
Print and Printf
The print statement does output with simple, standardized formatting.
You specify only the strings or numbers to be printed, in a list
separated by commas. They are output, separated by single spaces,
followed by a newline. The statement looks like this:
print item1 , item2 , ...
The simple statement print with no items is equivalent to print $0
it prints the entire current record. To print a blank line, use 'print ""', where "" is the null, or empty, string.
Using printf Statements for Fancier Printing
A format specifier starts with the character % and ends with a
format-control letter; it tells the printf statement how to output one
item. The
format-control letter specifies what kind of value to print. The rest
of the format specifier is made up of optional modifiers which are
parameters such as the field width to use.
Here is a list of the format-control letters:
c This prints a number as an ASCII character. Thus, 'printf "%c", 65'
outputs the letter A. The output for a string value is the first
character of the string.
d This prints a decimal integer.
i This also prints a decimal integer.
e This prints a number in scientific (exponential) notation. For example,
printf "%4.3e", 1950
prints 1.950e+03, with a total of four significant figures of
which three follow the decimal point. The 4.3 are modifiers,
discussed below.
f This prints a number in floating point notation.
g This prints a number in either scientific
notation or floating point notation, whichever uses fewer characters.
o This prints an unsigned octal integer.
s This prints a string.
x This prints an unsigned hexadecimal integer.
% This isn't really a format-control letter, but it does have a meaning
when used after a %: the sequence `%%' outputs one '%'. It does not
consume an argument.
A format specification can also include modifiers that can
control how much of the item's value is printed and how much space it
gets. The modifiers come between the '%' and the format-control letter.
Here are the possible modifiers, in the order in which they may appear:
'-'
The minus sign, used before the width modifier, says to left-justify
the argument within its specified width. Normally the argument is
printed right-justified in the specified width. Thus,
printf "%-4s", "foo"
prints 'foo '.
'width'
This is a number representing the desired width of a field. Inserting
any number between the '%' sign and the format control character forces
the field to be expanded to this width. The default way to do this is
to pad with spaces on the left. For example,
printf "%4s", "foo"
prints ' foo'. The value of width is a minimum width, not a
maximum. If the item value requires more than width characters, it can
be as wide as necessary. Thus,
printf "%4s", "foobar"
prints 'foobar'. Preceding the width with a minus sign
causes the output to be padded with spaces on the right, instead of on
the left.
'.prec'
This is a number that specifies the precision to use when printing.
This specifies the number of digits you want printed to the right of
the decimal point. For a string, it specifies the maximum number of
characters from the string that should be printed.
The C library printf's dynamic width and prec capability (for
example, "%*.*s") is supported. Instead of supplying explicit width
and/or prec values in the format string, you pass them in the argument
list. For example:
w = 5
p = 3
s = "abcdefg"
printf "<%*.*s>\n", w, p, s
is exactly equivalent to
s = "abcdefg"
printf "<%5.3s>\n", s
Both programs output '<**abc>'. (the bullet
symbol "*" is used to represent a space, to clearly show you that there are two
spaces in the output.)
PRINT AND PRINTF again
The simplest output statement is the by-now familiar "print" statement. There's not too much to it:
• "Print" by itself prints the input line.
• "Print" with one argument prints the argument.
• "Print" with
multiple arguments prints all the arguments, separated by spaces
(or other specified OFS) when the arguments are separated by commas, or
concatenated when the arguments are separated by spaces.
* The "printf()" (formatted print) function is much more flexible, and trickier. It has the syntax:
printf(<string>,<expression list>)
The "string" can be a normal string of characters:
printf("Hi, there!")
This prints "Hi, there!" to the display, just like "print" would,
with one slight difference: the cursor remains at the end of the text,
instead of skipping to the next line, as it would with "print". A
"newline" code ("\n") has to be added to force "printf()" to skip to
the next line:
printf("Hi, there!\n")
So far, "printf()" looks like a step backward from "print", and
if you use it to do dumb things like this, it is. However, "printf()"
is useful when you want precise control over the appearance of the
output.
The trick is that the string can contain format or "conversion"
codes to control the results of the expressions in the expression list.
For example, the following program:
BEGIN {x = 35; printf("x = %d decimal, %x hex, %o octal.\n",x,x,x)}
-- prints:
x = 35 decimal, 23 hex, 43 octal.
The format codes in this example include: "%d" (specifying
decimal output), "%x" (specifying hexadecimal output), and "%o"
(specifying octal output). The "printf()" function substitutes the
three variables in the expression list for these format codes on output.
* The format codes are highly flexible and their use can be a bit
confusing. The "d" format code prints a number in decimal format. The
output is an integer, even if the number is a real, like 3.14159.
Trying to print a string with this format code results in a "0" output.
For example:
x = 35; printf("x = %d\n",x) yields: x = 35
x = 3.1415; printf("x = %d\n",x) yields: x = 3
x = "TEST"; printf("x = %d\n",x) yields: x = 0
* The "o" format code prints a number in octal format. Other than
that, this format code behaves exactly as does the "%d" format
specifier. For example:
awk 'BEGIN {x = 255; printf("x = %o\n",x)}' yields: x = 377
* The "x" format code prints a number in hexadecimal format.
Other than that, this format code behaves exactly as does the "%d"
format specifier. For example:
x = 197; printf("x = %x\n",x) yields: x = c5
* The "c" format code prints a character, given its numeric code.
For example, the following statement outputs all the printable
characters:
BEGIN {for (ch=32; ch<128; ch++) printf("%c %c\n",ch,ch+128)}
* The "s" format code prints a string. For example:
x = "jive"; printf("string = %s\n",x) yields: string = jive
* The "e" format code prints a number in exponential format, in the default format:
[-]D.DDDDDDe[+/-]DDD
For example:
x = 3.1415; printf("x = %e\n",x) yields: x = 3.141500e+000
* The "f" format code prints a number in floating-point format, in the default format:
[-]D.DDDDDD
For example:
x = 3.1415; printf("x = %f\n",x) yields: f = 3.141500
* The "g" format code prints a number in exponential or floating-point format, whichever is shortest.
* A numeric string may be inserted between the "%" and the format
code to specify greater control over the output format. For example:
%3d
%5.2f
%08s
%-8.4s
This works as follows:
• The integer part of
the number specifies the minimum "width", or number of spaces,
the output will use, though the output may exceed that width if
it is too long to fit.
• The fractional part
of the number specifies either, for a string, the maximum number
of characters to be printed; or, for floating-point formats, the
number of digits to be printed to the right of the decimal point.
• A leading "-"
specifies left-justified output. The default is right-justified output.
• A leading "0"
specifies that the output be padded with leading zeroes to fill
up the output field. The default is spaces.
For example, consider the output of a string:
x = "Baryshnikov"
printf("[%3s]\n",x)
yields: [Baryshnikov]
printf("[%16s]\n",x)
yields: [
Baryshnikov]
printf("[%-16s]\n",x)
yields:
[Baryshnikov ]
printf("[%.3s]\n",x)
yields: [Bar]
printf("[%16.3s]\n",x)
yields:
[
Bar]
printf("[%-16.3s]\n",x)
yields:
[Bar
]
printf("[%016s]\n",x)
yields: [00000Baryshnikov]
printf("[%-016s]\n",x)
yields:
[Baryshnikov ]
-- or an integer:
x = 312
printf("[%2d]\n",x)
yields: [312]
printf("[%8d]\n",x)
yields: [
312]
printf("[%-8d]\n",x)
yields:
[312 ]
printf("[%.1d]\n",x)
yields: [312]
printf("[%08d]\n",x)
yields: [00000312]
printf("[%-08d]\n",x)
yields:
[312 ]
-- or a floating-point number:
x = 251.673209
printf("[%2f]\n",x)
yields: [251.67309]
printf("[%16f]\n",x)
yields:
[ 251.67309]
printf("[%-16f]\n",x)
yields:
[251.67309 ]
printf("[%.3f]\n",x)
yields: [251.673]
printf("[%16.3f]\n",x)
yields:
[ 251.673]
printf("[%016.3f]\n",x) yields: [00000000251.673]
----------
The keywords BEGIN and END
are used to perform specific actions before and after reading the input
lines. The BEGIN keyword is normally associated with printing titles
and setting default values, whilst the END keyword is normally
associated with printing totals
awk 'BEGIN {string = "Super" "power"; print string}'
this will print: Superpower
For example, to extract and print the word "get" from "unforgettable":
BEGIN {print substr("unforgettable",6,3)}
Please be aware that the first character of the string is
numbered "1", not "0". To extract a substring of at most ten
characters, starting from position 6 of the first field variable, you
use:
substr($1,6,10)
----------
Escape sequences
Sequence Description
\b
Backspace
\f
Formfeed
\n
Newline
\r Carriage
Return
\t Horizontal
tab
\" Double quote
\a The "alert" character; usually the ASCII BEL character
\v Vertical tab
example: awk '{ print $0 "\n"}' myfile
add a new empty line after each line
Regular Expressions
Pattern searching similar to grep and other unix utilities:
/386/
$1 ~ /386/
In regular expressions, the following symbols are metacharacters with special meanings.
\ ^ $
. [ ] * + ? ( ) |
^ matches the first character of a string
$ matches the last character of a string
. matches a single character of a string
[ ] defines a set of characters
( ) used for grouping
| specifies alternatives
displays all which do not contain 2, 3, 4, 6 or 8 in first field
awk '$1 ~ /[^23468]/ { print $0 }'
How to hide special characters from the shell, this depends on the shell !
Preceding any single character with a backslash ('\') quotes that character.
Thus:
awk "BEGIN { print \"Don't Panic!\" }"
you get
tcsh: Unmatched '
but if you use bash, it works
With tcsh you need to write this:
awk 'BEGIN { print "Here is a single quote '\''" }'
the result is:
Here is a single quote '
Regular expressions are the extended kind found in egrep. They are composed of characters as follows:
c
matches the character c (assuming c is a character with no special meaning in regexps).
\c
matches the literal character c.
.
matches any character except newline.
^
matches the beginning of a line or a string.
$
matches the end of a line or a string.
[abc...]
matches any of the characters abc... (character class).
[^abc...]
matches any character except abc... and newline (negated character class).
r1|r2
matches either r1 or r2 (alternation).
r1r2
matches r1, and then r2 (concatenation).
r+
matches one or more r's.
r*
matches zero or more r's.
r?
matches zero or one r's.
(r)
matches r (grouping).
* The simplest kind search pattern that can be specified is a simple string, enclosed in forward-slashes ("/"). For example:
/The/
-- searches for any line that contains the string "The". This
will not match "the" as Awk is "case-sensitive", but it will match
words like "There" or "Them".
This is the crudest sort of search pattern. Awk defines special
characters or "metacharacters" that can be used to make the search more
specific. For example, preceding the string with a "^" tells Awk to
search for the string at the beginning of the input line. For example:
/^The/
-- matches any line that begins with the string "The". Similarly,
following the string with a "$" matches any line that ends with "The",
for example:
/The$/
But what if you actually want to search the text for a character
like "^" or "$"? Simple, just precede the character with a backslash
("\"). For example:
/\$/
-- matches any line with a "$" in it.
* Such a pattern-matching string is known as a "regular
expression". There are many different characters that can be used to
specify regular expressions. For example, it is possible to specify a
set of alternative characters using square brackets ("[]"):
/[Tt]he/
This example matches the strings "The" and "the". A range of characters can also be specified. For example:
/[a-z]/
-- matches any character from "a" to "z", and:
/[a-zA-Z0-9]/
-- matches any letter or number.
A range of characters can also be excluded, by preceding the range with a "^". For example:
/^[^a-zA-Z0-9]/
-- matches any line that doesn't start with a letter or digit.
A "|" allows regular expressions to be logically ORed. For example:
/(^Germany)|(^Netherlands)/
-- matches lines that start with the word "Germany" or the word
"Netherlands". Notice how parentheses are used to group the two
expressions.
* The "." special characters allows "wildcard" matching, meaning
it can be used to specify any arbitrary character. For example:
/wh./
-- matches "who", "why", and any other string that has the characters "wh" and any following character.
This use of the "." wildcard should be familiar to UN*X shell
users, but awk interprets the "*" wildcard in a subtly different way.
In the UN*X shell, the "*" substitutes for a string of arbitrary
characters of any length, including zero, while in awk the "*" simply
matches zero or more repetitions of the previous character or
expression. For example, "a*" would match "a", "aa", "aaa", and so on.
That means that ".*" will match any string of characters.
There are other characters that allow matches against repeated
characters expressions. A "?" matches zero or one occurrences of the
previous regular expression, while a "+" matches one or more
occurrences of the previous regular expression. For example:
/^[+-]?[0-9]+$/
-- matches any line that consists only of a (possibly signed)
integer number. This is a somewhat confusing example and it is helpful
to break it down by parts:
/^
Find string at beginning of line.
/^[-+]?
Specify possible "-" or "+" sign for number.
/^[-+]?[0-9]+ Specify one or more digits "0" through "9".
/^[-+]?[0-9]+$/ Specify that the line ends with the number.
The search can be constrained to a single field within the input line. For example:
$1 ~ /^France$/
-- searches for lines whose first field ("$1" -- more on "field variables" later) is the word "France", while:
$1 !~ /^Norway$/
-- searches for lines whose first field is not the word "Norway".
It is possible to search for an entire series or "block" of
consecutive lines in the text, using one search pattern to match the
first line in the block and another search pattern to match the last
line in the block. For example:
/^Ireland/,/^Summary/
-- matches a block of text whose first line begins with "Ireland" and whose last line begins with "Summary".
NF == 0
-- matches all blank lines, or those whose number of fields is zero.
$1 == "France"
-- is a string comparison that matches any line whose first field
is the string "France". The astute reader may notice that this example
seems to do the same thing as a the previous example:
$1 ~ /^France$/
In fact, both examples do the same thing, but in the example
immediately above the "^" and "$" metacharacters had to be used in the
regular expression to specify a match with the entire first field;
without them, it would match such strings as "FranceFour", "NewFrance",
and so on. The string expression matches only to "France".
* It is also possible to combine several search patterns with the "&&" (AND) and "||" (OR) operators. For example:
((NR >= 30) && ($1 == "France")) || ($1 == "Norway")
-- matches any line past the 30th that begins with "France", or any line that begins with "Norway".
* One class of pattern-matching that wasn't listed above is
performing a numeric comparison on a field variable. It can be done, of
course; for example:
$1 == 100
-- matches any line whose first field has a numeric value equal
to 100. This is a simple thing to do and it will work fine. However,
suppose you want to perform:
$1 < 100
This will generally work fine, but there's a nasty catch to it,
which requires some explanation. The catch is that if the first field
of the input can be either a number or a text string, this sort of
numeric comparison can give crazy results, matching on some text
strings that aren't equivalent to a numeric value.
This is because awk is a "weakly-typed" language. Its variables
can store a number or a string, with awk performing operations on each
appropriately. In the case of the numeric comparison above, if $1
contains a numeric value, awk will perform a numeric comparison on it,
as expected; but if $1 contains a text string, awk will perform a text
comparison between the text string in $1 and the three-letter text
string "100". This will work fine for a simple test of equality or
inequality, since the numeric and string comparisons will give the same
results, but it will give crazy results for a "less than" or "greater
than" comparison.
Awk is not broken; it is doing what it is supposed to do in this
case. If this problem comes up, it is possible to add a second test to
the comparison to determine if the field contains a numeric value or a
text string. This second test has the form:
(( $1 + 0 ) == $1 )
If $1 contains a numeric value, the left-hand side of this
expression will add 0 to it, and awk will perform a numeric comparison
that will always be true.
If $1 contains a text string that doesn't look like a number, for
want of anything better to do awk will interpret its value as 0. This
means the left-hand side of the expression will evaluate to zero; since
there is a non-numeric text string in $1, awk will perform a string
comparison that will always be false. This leads to a more workable
comparison:
((( $1 + 0 ) == $1 ) && ( $1 > 100 ))
AWK Numerical Functions
Name
Function
cos(x) Cosine with x in radians
exp(x)
Exponent
int(x)
Integer part of x truncated towards 0
log(x)
Logarithm (natural logarithm of x )
sin(x)
Sine with x in radians
sqrt(x) Square
Root
atan2(y,x)
Arctangent of y/x in radians
rand()
Random
srand(x) Seed
Random
awk 'BEGIN { for (i = 1; i <= 7; i++) print int(101 * rand()) }'
This program prints 7 random numbers from 0 to 100, inclusive.
awk '{print sqrt($1)}' myfile
Print the square root for numbers in field 1
rand()
This gives you a random number. The values of rand are
uniformly-distributed between 0 and 1. The value is never 0 and never
1. Often you want random integers instead. Here is a user-defined
function you can use to obtain a random nonnegative integer less than n:
function randint(n) {
return int(n * rand())
}
The multiplication produces a random real number greater than 0
and less than n. We then make it an integer (using int) between 0 and n
- 1. Here is an example where a similar function is used to
produce random integers between 1 and n. Note that this program will
print a new random number for each input record.
awk '
# Function to roll a simulated die.
function roll(n) { return 1 + int(rand() * n) }
# Roll 3 six-sided dice and print total number of points.
{
printf("%d points\n", roll(6)+roll(6)+roll(6))
}'
Note: rand starts generating numbers from the same point, or
seed, each time you run awk. This means that a program will produce the
same results each time you run it. The numbers are random within one
awk run, but predictable from run to run. This is convenient for
debugging, but if you want a program to do different things each time
it is used, you must change the seed to a value that will be different
in each run. To do this, use srand.
srand(x)
The function srand sets the starting point, or seed, for generating
random numbers to the value x. Each seed value leads to a
particular sequence of "random" numbers. Thus, if you set the seed to
the same value a second time, you will get the same sequence of
"random" numbers again. If you omit the argument x, as in
srand(), then the current date and time of day are used for a seed.
This is the way to get random numbers that are truly
unpredictable. The return value of srand is the previous seed.
This makes it easy to keep track of the seeds for use in consistently
reproducing sequences of random numbers.
String Functions
index(string,search)
length(string)
split(string,array,separator)
substr(string,position)
substr(string,position,max)
sub(regex,replacement)
sub(regex,replacement,string)
gsub(regex,replacement)
gsub(regex,replacement,string)
match(string,regex)
tolower(string)
toupper(string)
system(cmd-line)
Execute the command cmd-line, and return the exit status
Example
The string function gsub to replace each occurrence of 286 with the string AT
awk '{ gsub( /286/, "AT" ); print $0 }' myfile
awk '{print tolower($0)}' myfile
If myfile contains:
50.211 14.979 24.196
50.142 15.162 25.415
awk '{split($0,a," "); print a[1]}' myfile
will give :
50.211
50.142
if i do this only on line 1:
awk 'NR ==1 {split($0,a," "); print a[1]}' 2points
i get:
50.211
If the myfile contains:
Processing NGC 2345
awk '{print substr($0,12,8)}' myfile
will give: NGC 2345
The "split()" function has the syntax:
split(<string>,<array>,[<field separator>])
This function takes a string with n fields and stores the fields
into array[1], array[2], ... , array[n]. If the optional field
separator is not specified, the value of FS (normally "white space",
the space and tab characters) is used. For example, suppose we have a
field of the form:
joe:frank:harry:bill:bob:sil
We could use "split()" to break it up and print the names as follows:
my_string = "joe:frank:harry:bill:bob:sil";
split(my_string,names,":");
print names[1];
print names[2];
...
The "index()" function has the syntax:
index(<target string>,<search string>)
-- and returns the position at which the search string begins in
the target string (remember, the initial position is "1"). For example:
index("gorbachev","bach") returns: 4
index("superficial","super") returns: 1
index("sunfire","fireball") returns: 0
index("aardvark","z") returns: 0
Match(string, regexp)
The match function searches the string, string, for the longest,
leftmost substring matched by the regular expression, regexp. It
returns the character position, or index, of where that substring
begins (1, if it starts at the beginning of string). If no match if
found, it returns 0. The match function sets the built-in
variable RSTART to the index. It also sets the built-in variable
RLENGTH to the length in characters of the matched substring. If no
match is found, RSTART is set to 0, and RLENGTH to -1. For
example:
awk '{
if ($1 == "FIND")
regex = $2
else {
where = match($0, regex)
if (where)
print "Match of", regex, "found at", where, "in", $0
}
}'
This program looks for lines that match the regular expression stored
in the variable regex. This regular expression can be changed. If the
first word on a line is FIND, regex is changed to be the second word
on that line. Therefore, given:
FIND fo*bar
My program was a foobar
But none of it would doobar
FIND Melvin
JF+KM
This line is property of The Reality Engineering Co.
This file created by Melvin.
awk prints:
Match of fo*bar found at 18 in My program was a foobar
Match of Melvin found at 26 in This file created by Melvin.
split(string, array, fieldsep)
This divides string into pieces separated by fieldsep, and stores the
pieces in array. The first piece is stored in array[1], the second
piece in array[2], and so forth. The string value of the third
argument, fieldsep, is a regexp describing where to split string (much
as FS can be a regexp describing where to split input records). If the
fieldsep is omitted, the value of FS is used. split returns the number
of elements created. The split function, then, splits strings into
pieces in a manner similar to the way input lines are split into
fields. For example:
split("auto-da-fe", a, "-")
splits the string `auto-da-fe' into three fields using `-' as the separator. It sets the contents of the array a as follows:
a[1] = "auto"
a[2] = "da"
a[3] = "fe"
The value returned by this call to split is 3. As with input
field-splitting, when the value of fieldsep is " ", leading and
trailing whitespace is ignored, and the elements are separated by runs
of whitespace.
sprintf(format, expression1,...)
This returns (without printing) the string that printf would have
printed out with the same arguments (see section Using printf
Statements for Fancier Printing). For example:
sprintf("pi = %.2f (approx.)", 22/7)
returns the string "pi = 3.14 (approx.)".
sub(regexp, replacement, target)
The sub function alters the value of target. It searches this value,
which should be a string, for the leftmost substring matched by the
regular expression, regexp, extending this match as far as possible.
Then the entire string is changed by replacing the matched text with
replacement. The modified string becomes the new value of target.
This function is peculiar because target is not simply used to compute
a value, and not just any expression will do: it must be a variable,
field or array reference, so that sub can store a modified value there.
If this argument is omitted, then the default is to use and alter
$0. For example:
str = "water, water, everywhere"
sub(/at/, "ith", str)
sets str to "wither, water, everywhere", by replacing the leftmost,
longest occurrence of 'at' with 'ith'. The sub function returns
the number of substitutions made (either one or zero). If the
special character '&' appears in replacement, it stands for the
precise substring that was matched by regexp. (If the regexp can match
more than one string, then this precise substring may vary.) For
example:
awk '{ sub(/candidate/, "& and his wife"); print }'
changes the first occurrence of 'candidate' to 'candidate and his wife' on each input line. Here is another example:
awk 'BEGIN {
str = "daabaaa"
sub(/a*/, "c&c", str)
print str
}'
prints 'dcaacbaaa'. This show how '&' can represent a
non-constant string, and also illustrates the "leftmost, longest"
rule. The effect of this special character ('&') can be
turned off by putting a backslash before it in the string. As usual, to
insert one backslash in the string, you must write two backslashes.
Therefore, write '\\&' in a string constant to include a literal
'&' in the replacement. For example, here is how to replace the
first `|' on each line with an '&':
awk '{ sub(/\|/, "\\&"); print }'
Note: as mentioned above, the third argument to sub must be an
lvalue. Some versions of awk allow the third argument to be an
expression which is not an lvalue. In such a case, sub would still
search for the pattern and return 0 or 1, but the result of the
substitution (if any) would be thrown away because there is no place to
put it. Such versions of awk accept expressions like this:
sub(/USA/, "United States", "the USA and Canada")
But that is considered erroneous in gawk.
gsub(regexp, replacement, target)
This is similar to the sub function, except gsub replaces all of
the longest, leftmost, nonoverlapping matching substrings it can find.
The 'g' in gsub stands for "global," which means replace everywhere.
For example:
awk '{ gsub(/Britain/, "United Kingdom"); print }'
replaces all occurrences of the string 'Britain' with 'United
Kingdom' for all input records. The gsub function returns the number of
substitutions made. If the variable to be searched and altered, target,
is omitted, then the entire input record, $0, is used. As in sub, the
characters '&' and '\' are special, and the third argument must be
an lvalue.
substr(string, start, length)
This returns a length-character-long substring of string,
starting at character number start. The first character of a string is
character number one. For example, substr("washington", 5, 3) returns
"ing". If length is not present, this function returns the whole suffix
of string that begins at character number start. For example,
substr("washington", 5) returns "ington". This is also the case if
length is greater than the number of characters remaining in the
string, counting from character number start.
tolower(string)
This returns a copy of string, with each upper-case character in
the string replaced with its corresponding lower-case character.
Nonalphabetic characters are left unchanged. For example,
tolower("MiXeD cAsE 123") returns "mixed case 123".
toupper(string)
This returns a copy of string, with each lower-case character in
the string replaced with its corresponding upper-case character.
Nonalphabetic characters are left unchanged. For example,
toupper("MiXeD cAsE 123") returns "MIXED CASE 123".
The Match function
getline
getline <file
getline variable
getline variable
<file
awk provides the function getline to read input from the current input file or from a file or pipe.
getline reads the next input line, splitting it into fields according
to the settings of NF, NR and FNR. It returns 1 for success, 0 for
end-of-file, and -1 on error.
The statement
getline < "temp.dat"
reads the next input line from the file "temp.dat", field splitting is performed, and NF is set.
The statement
getline data < "temp.dat"
reads the next input line from the file "temp.dat" into the user
defined variable data, no field splitting is done, and NF, NR and FNR
are not altered.
You can take input from keyboard while running awk script, try the following awk script:
awk 'BEGIN {print "your name"; getline na <"-"; print "my name is " na}'
Here getline function is used to read input from keyboard and then assign the data (inputted from keyboard) to variable.
Syntax:
getline variable-name < "-"
|
|
|
1 2
3
1 --> getline is function name
2 --> variable-name is used to assign the value read from input
3 --> Means read from stdin (keyboard)
Function Definition Example
Here is an example of a user-defined function, called myprint, that takes a number and prints it in a specific format.
function myprint(num)
{
printf "%6.3g\n", num
}
To illustrate, here is an awk rule which uses our myprint function:
$3 > 0 { myprint($3) }
This program prints, in our special format, all the third fields
that contain a positive number in our input. Therefore, when given:
1.2 3.4 5.6 7.8
9.10 11.12 -13.14 15.16
17.18 19.20 21.22 23.24
this program, using our function to format the results, prints:
5.6
21.2
Here is an example of a recursive function. It prints a string backwards:
function rev (str, len) {
if (len == 0) {
printf "\n"
return
}
printf "%c", substr(str, len, 1)
rev(str, len - 1)
}
Calling User-defined Functions
Calling a function means causing the function to run and do its job. A
function call is an expression, and its value is the value returned by
the function.
A function call consists of the function name followed by the arguments
in parentheses. What you write in the call for the arguments are awk
expressions; each time the call is executed, these expressions are
evaluated, and the values are the actual arguments. For example, here
is a call to foo with three arguments (the first being a string
concatenation):
foo(x y, "lose", 4 * z)
Caution: whitespace characters (spaces and tabs) are not allowed
between the function name and the open-parenthesis of the argument
list. If you write whitespace by mistake, awk might think that you mean
to concatenate a variable with an expression in parentheses. However,
it notices that you used a function name and not a variable name, and
reports an error.
When a function is called, it is given a copy of the values of its
arguments. This is called call by value. The caller may use a variable
as the expression for the argument, but the called function does not
know this: it only knows what value the argument had. For example, if
you write this code:
foo = "bar"
z = myfunc(foo)
then you should not think of the argument to myfunc as being "the
variable foo." Instead, think of the argument as the string value,
"bar".
If the function myfunc alters the values of its local variables, this
has no effect on any other variables. In particular, if myfunc does
this:
function myfunc (win) {
print win
win = "zzz"
print win
}
to change its first argument variable win, this does not change
the value of foo in the caller. The role of foo in calling myfunc ended
when its value, "bar", was computed. If win also exists outside of
myfunc, the function body cannot alter this outer value, because it is
shadowed during the execution of myfunc and cannot be seen or changed
from there.
However, when arrays are the parameters to functions, they are
not copied. Instead, the array itself is made available for direct
manipulation by the function. This is usually called call by reference.
Changes made to an array parameter inside the body of a function are
visible outside that function. This can be very dangerous if you
do not watch what you are doing. For example:
function changeit (array, ind, nvalue) {
array[ind] = nvalue
}
BEGIN {
a[1] = 1 ; a[2] = 2 ; a[3] = 3
changeit(a, 2, "two")
printf "a[1] = %s, a[2] = %s, a[3] = %s\n", a[1], a[2], a[3]
}
prints 'a[1] = 1, a[2] = two, a[3] = 3', because calling changeit stores "two" in the second element of a.
The return Statement
The body of a user-defined function can contain a return statement.
This statement returns control to the rest of the awk program. It can
also be used to return a value for use in the rest of the awk program.
It looks like this:
return expression
The expression part is optional. If it is omitted, then the returned value is undefined and, therefore, unpredictable.
A return statement with no value expression is assumed at the end
of every function definition. So if control reaches the end of the
function body, then the function returns an unpredictable value.
awk will not warn you if you use the return value of such a function;
you will simply get unpredictable or unexpected results.
Here is an example of a user-defined function that returns a value for the largest number among the elements of an array:
function maxelt (vec, i, ret) {
for (i in vec) {
if (ret == "" || vec[i] > ret)
ret = vec[i]
}
return ret
}
You call maxelt with one argument, which is an array name. The local
variables i and ret are not intended to be arguments; while there is
nothing to stop you from passing two or three arguments to maxelt, the
results would be strange. The extra space before i in the function
parameter list is to indicate that i and ret are not supposed to be
arguments. This is a convention which you should follow when you define
functions.
Here is a program that uses our maxelt function. It loads an
array, calls maxelt, and then reports the maximum number in that array:
awk '
function maxelt (vec, i, ret) {
for (i in vec) {
if (ret == "" || vec[i] > ret)
ret = vec[i]
}
return ret
}
# Load all fields of each record into nums.
{
for(i = 1; i <= NF; i++)
nums[NR, i] = $i
}
END {
print maxelt(nums)
}'
Given the following input:
1 5 23 8 16
44 3 5 2 8 26
256 291 1396 2962 100
-6 467 998 1101
99385 11 0 225
the program tells us that:
99385
is the largest number in our array.
awk Control Flow Statements
if ( expression ) statement1 else statement2
while ( expression ) statement
for ( expression1; expression; expression2 ) statement
The syntax of "if ... else" is:
if (<condition>) <action 1> [else <action 2>]
The "else" clause is optional. The "condition" can be any
expression discussed in the section on pattern matching, including
matches with regular expressions.
For example, consider the following Awk program:
{if ($1=="green") print "GO";
else if ($1=="yellow") print "SLOW DOWN";
else if ($1=="red") print "STOP";
else print "WHAT";}
The syntax for "while" is:
while (<condition>) <action>
The "action" is performed as long the "condition" tests true, and
the "condition" is tested before each iteration. The conditions are the
same as for the "if ... else" construct. For example, since by default
an Awk variable has a value of 0, the following Awk program could print
the numbers from 1 to 20:
BEGIN {while(++x<=20) print x}
* The "for" loop is more flexible. It has the syntax:
for (<initial action>;<condition>;<end-of-loop action>) <action>
For example, the following "for" loop prints the numbers 10 through 20 in increments of 2:
BEGIN {for (i=10; i<=20; i+=2) print i}
This is equivalent to:
i=10
while (i<=20) {
print i;
i+=2;}
The "for" loop has an alternate syntax, used when scanning through an array:
for (<variable> in <array>) <action>
with the example:
my_string = "joe:frank:harry:bill:bob:sil";
split(my_string, names, ":");
-- then the names could be printed with the following statement:
for (idx in names) print idx, names[idx];
This yields:
2 frank
3 harry
4 bill
5 bob
6 sil
1 joe
Notice that the names are not printed in the proper order. One of
the characteristics of this type of "for" loop is that the array is not
scanned in a predictable order.
Awk defines three unconditional control statements: "break",
"continue", "next", and "exit". "Break" and "continue" are strictly
associated with the "while" and "for" loops:
• break: Causes a jump out of the loop.
• continue: Forces the next iteration of the loop.
"Next" and "exit" control Awk's input scanning:
• next: Causes Awk to
immediately get another line of input and begin scanning it from the
first match statement.
• exit: Causes Awk to
end reading its input and execute END operations, if any are
specified.
Limits
Each implementation of awk imposes some limits. Below are typical limits
100 fields
2500 characters per input line
2500 characters per output line
1024 characters per individual field
1024 characters per printf string
400 characters maximum quoted string
400 characters in character class
15 open files
1 pipe
EXAMPLES
There are millions of different ways to do things...here are few examples
The simplest action is to print some or all of a record; this is accomplished by the awk command print.
The awk program
awk '{ print }' myfile
Prints each record
while
awk '{print $2, $1}' myfile
prints the first two fields in reverse order but
awk '{print $1 $2}' myfile
will group the 2 fields
awk '{ print $1 >"foo1"; print $2 >"foo2" }' myfile
will put the data into file foo1 and file foo2
The variables OFS and ORS may be used to change the current
output field separator and output record separator.
The output record separator is appended to the
output of the print statement.
Awk also provides the printf statement for output formatting.
BEGIN and END
The special pattern BEGIN matches the
beginning of the input, before the first record is
read. The pattern END matches the end of the input, after the last record has been
processed. BEGIN and END thus provide a way to gain control before and
after processing.
-------
awk '{ if ($2 =="0.5") {print $0} }' myfile
prints the lines for which field 2 = 0.5
Test count things:
awk 'BEGIN {counter = 0} {if ($2 == "0.5"){counter++}} END {print counter} ' myfile
this tells me how many times field 2 has a value of 0.5
Using Awk to create a simple histogram
We have a file with scores in a file called mydata
r 0.2 99
r 0.1 88
r 0.4 76
r 0.1 76
r 0.2 56
r 0.3 900
r 0.2 43
r 0.5 5
r 0.5 9
r 0.6 56
r 0.8 43
r 0.7 33
r 0.9 10
we can sort the second column with :
sort +1 -n mydata > mydata_sorted
this gives:
r 0.1 76
r 0.1 88
r 0.2 43
r 0.2 56
r 0.2 99
r 0.3 900
r 0.4 76
r 0.5 5
r 0.5 9
r 0.6 56
r 0.7 33
r 0.8 43
r 0.9 10
(sorting can be done in descending (reverse) order with sort -nr)
You can put the following lines in a file called histo.txt
to Print frequency histogram of column (field 2) of numbers
$2 <= 0.1 {na=na+1}
($2 > 0.1) && ($2 <= 0.2) {nb = nb+1}
($2 > 0.2) && ($2 <= 0.3) {nc = nc+1}
($2 > 0.3) && ($2 <= 0.4) {nd = nd+1}
($2 > 0.4) && ($2 <= 0.5) {ne = ne+1}
($2 > 0.5) && ($2 <= 0.6) {nf = nf+1}
($2 > 0.6) && ($2 <= 0.7) {ng = ng+1}
($2 > 0.7) && ($2 <= 0.8) {nh = nh+1}
($2 > 0.8) && ($2 <= 0.9) {ni = ni+1}
($2 > 0.9) {nj = nj+1}
END {print na, nb, nc, nd, ne, nf, ng, nh, ni, nj, NR}
and run
awk -f histo.txt mydata_sorted
this will give:
2 3 1 1 2 1 1 1 1 13
meaning, 0.1 occurs twice, 0.9, once
-------
PDB file
Count residue in PDB file
awk 'BEGIN{
counter = 0
}
{
if ($3 == "CA"){
counter++
}
}
END{
print counter
}
To select lines with Atom and Ca and get the amino acid name
awk '$1 == "ATOM" && $3 == "CA" {print $4}' mypdb.pdb
or
awk '$1=="ATOM" ! $1=="HETATM"' my.pdb | grep CA | awk '{print $4}' > myoutput.pdb
Get the sequence from a PDB file, warning might be some strange aa...
awk '$1 == "ATOM" && $3 == "CA" {print $4}' mypdb.pdb | awk
' { gsub( /VAL/, "V"); gsub( /GLY/, "G"); gsub( /ALA/, "A"); gsub(
/LEU/, "L"); gsub( /ILE/, "I"); gsub( /SER/, "S"); gsub( /THR/, "T");
gsub( /ASP/, "D"); gsub( /ASN/, "N"); gsub( /LYS/, "K"); gsub( /GLU/,
"E"); gsub( /GLN/, "Q"); gsub( /ARG/, "R"); gsub( /HIS/, "H"); gsub(
/PHE/, "F"); gsub( /CYS/, "C"); gsub( /TRP/, "W"); gsub( /TYR/, "Y");
gsub( /MET/, "M"); gsub( /PRO/, "P"); residues = residues $1} END
{print residues }' > myoutput.seq
Distance between 2 atoms in a PDB file - test1 - I try something simple
if we have the x, y, z coordinates of 2 atoms in a file
x y z
50.211 14.979 24.196 my atom x1
50.142 15.162 25.415 my atom x2
The distance between these 2 atoms is:
sqrt((x1-x2)^2 + (y1-y2)^2 + (z1-z2)^2)
awk '{printf "%s\t", $0}' mycoordinates > mypoints
to make 1 row of many columns
I get this in my file test1:
$1 $2
$3 $4
$5 $6
x1 y1
z1 x2
y2 z2
50.211 14.979 24.196 50.142 15.162 25.415
awk '{ a=sqrt(($1-$4)^2 + ($2-$5)^2 + ($3-$6)^2); print a}' test1
result is 1.23459
Distance between 2 atoms in a PDB file - test2
Now, if i have in my file test2:
50.211 14.979 24.196 my atom x1
50.142 15.162 25.415 my atom x2
I can for example do for field 1:
thus the file contains
50.211
50.142
awk 'NR ==1 {split($0,a," ")}; NR == 2 {split($0,b," ")} END {print sqrt((a[1]-b[1])^2)}' test2
for the complete file test2:
awk 'NR ==1 {split($0,a," ")}; NR == 2 {split($0,b," ")} END {print sqrt((a[1]-b[1])^2 + (a[2]-b[2])^2 + (a[3]-b[3])^2)}' test2
the result is: 1.23459
with:
awk 'NR ==1 {split($0,a," ")}; NR == 2 {split($0,b," ")} END {printf
"%.3f\n", sqrt((a[1]-b[1])^2 + (a[2]-b[2])^2 + (a[3]-b[3])^2)}'
result is : 1.235
-------
Using getline
if in a file called molecule.txt i have:
ID 44000
jkjk
jkk
END
ID 400009
jkkk
mvnvbc
END
ID 58939
jkd
jkjd
END
ID 400009
jj
END
Thus four molecules, ending with END and with an empty line at the end of the file
and a file called molecule_ID_numbers.txt with 4 lines and 4 numbers:
7888
9000
10000
15000
I can use getline var
This form of the getline function takes its input from the
file molecule_ID_numbers.txt and puts it in the variable var.
The following program reads its input record from the file
molecule.txt when it encounters a first field with a value equal to ID
in the current input file. It also adds an empty line after END
and <CAS NUMBER> and the value in molecule_ID_numbers.txt. We
should better have the same number of ID and number in the
molecule_ID_numbers.txt file. The empty line at the end of
molecule.txt is important.
awk '{ if ($1 == ID) {getline var < "molecule_ID_numbers.txt" ; print "\n", "\<CAS NUMBER\>" "\n", var, "\n\n"} else print}' < "molecule.txt" > myout
The output is in myOUT:
ID 44000
jkjk
jkk
END
<CAS NUMBER>
7888
ID 400009
jkkk
mvnvbc
END
<CAS NUMBER>
9000
ID 58939
jkd
jkjd
END
<CAS NUMBER>
10000
ID 400009
jj
END
<CAS NUMBER>
15000
-------
COUNTING STUFF
awk 'BEGIN{
counter = 0
}
{
if ($3 >= 0 && $3 <= 1 ){
counter++
}
}
END{
print "Scores_0_to_1 " counter
}' mylistwithnameshort_test.txt
awk 'BEGIN{
counter = 0
}
{
if ($3 >= 1 && $3 <= 2 ){
counter++
}
}
END{
print "Scores_1_to_2 " counter
}' mylistwithnameshort_test.txt
awk 'BEGIN{
counter = 0
}
{
if ($3 >= 2 && $3 <= 3 ){
counter++
}
}
END{
print "Scores_2_to_3 " counter
}' mylistwithnameshort_test.txt
awk 'BEGIN{
counter = 0
}
{
if ($3 >= 3 && $3 <= 4 ){
counter++
}
}
END{
print "Scores_3_to_4 " counter
}' mylistwithnameshort_test.txt
awk 'BEGIN{
counter = 0
}
{
if ($3 >= 4 && $3 <= 5 ){
counter++
}
}
END{
print "Scores_4_to_5 " counter
}' mylistwithnameshort_test.txt
awk 'BEGIN{
counter = 0
}
{
if ($3 >= 5 && $3 <= 6 ){
counter++
}
}
END{
print "Scores_5_to_6 " counter
}' mylistwithnameshort_test.txt
awk 'BEGIN{
counter = 0
}
{
if ($3 >= 6 && $3 <= 7 ){
counter++
}
}
END{
print "Scores_6_to_7 " counter
}' mylistwithnameshort_test.txt
awk 'BEGIN{
counter = 0
}
{
if ($3 >= 7 && $3 <= 8 ){
counter++
}
}
END{
print "Scores_7_to_8 " counter
}' mylistwithnameshort_test.txt
awk 'BEGIN{
counter = 0
}
{
if ($3 >= 8 && $3 <= 9 ){
counter++
}
}
END{
print "Scores_8_to_9 " counter
}' mylistwithnameshort_test.txt
awk 'BEGIN{
counter = 0
}
{
if ($3 >= 9 && $3 <= 10 ){
counter++
}
}
END{
print "Scores_9_to_10 " counter
}' mylistwithnameshort_test.txt
awk 'BEGIN{
counter = 0
}
{
if ($3 >= 10 && $3 <= 11 ){
counter++
}
}
END{
print "Scores_10_to_11 " counter
}' mylistwithnameshort_test.txt
awk 'BEGIN{
counter = 0
}
{
if ($3 >= 11 && $3 <= 12 ){
counter++
}
}
END{
print "Scores_11_to_12 " counter
}' mylistwithnameshort_test.txt
awk 'BEGIN{
counter = 0
}
{
if ($3 >= 12 && $3 <= 13 ){
counter++
}
}
END{
print "Scores_12_to_13 " counter
}' mylistwithnameshort_test.txt
awk 'BEGIN{
counter = 0
}
{
if ($3 >= 13 && $3 <= 14 ){
counter++
}
}
END{
print "Scores_13_to_14 " counter
}' mylistwithnameshort_test.txt
awk 'BEGIN{
counter = 0
}
{
if ($3 >= 14 && $3 <= 15 ){
counter++
}
}
END{
print "Scores_14_to_15 " counter
}' mylistwithnameshort_test.txt
awk 'BEGIN{
counter = 0
}
{
if ($3 >= 15 && $3 <= 20 ){
counter++
}
}
END{
print "Scores_15_to_20 " counter
}' mylistwithnameshort_test.txt
-------
echo "the script starts"
echo "check that each compound starts with ISIS"
echo -n "select the SDF file:"
read file1
echo "the name of the file is $file1"
tr '\r' '\n' < "$file1" > "tmp1_unix.sdf"
echo "step 1"
awk 'NF > 0' < "tmp1_unix.sdf" > "tmp2_unix_no_emptylines.sdf"
echo "this is done"
-------
Rename files
#!/bin/sh
# we have less than 3 arguments. Print the help text:
if [ $# -lt 3 ] ; then
cat <<HELP
ren -- renames a number of files using sed regular expressions
USAGE: ren 'regexp' 'replacement' files...
EXAMPLE: rename all *.HTM files in *.html:
ren 'HTM$' 'html' *.HTM
HELP
exit 0
fi
OLD="$1"
NEW="$2"
# The shift command removes one argument from the list of
# command line arguments.
shift
shift
# $* contains now all the files:
for file in $*; do
if [ -f "$file" ] ; then
newfile=`echo "$file" | sed "s/${OLD}/${NEW}/g"`
if [ -f "$newfile" ]; then
echo "ERROR: $newfile exists already"
else
echo "renaming $file to $newfile ..."
mv "$file" "$newfile"
fi
fi
done
Rename files
myfiles=`ls toto*`
for i in $myfiles ; do
echo "mv $i $i"\.smi
done
-------
Some other examples
# Print first two fields in opposite order:
awk '{ print $2, $1 }' file
# Print lines longer than 72 characters:
awk 'length > 72' file
# Print length of string in 2nd column
awk '{print length($2)}' file
# Add up first column, print sum and average:
{ s += $1 }
END { print "sum is", s, " average is", s/NR }
# Print fields in reverse order:
awk '{ for (i = NF; i > 0; --i) print $i }' file
# Print the last line
{line = $0}
END {print line}
# Print the total number of lines that contain the word Pat
/Pat/ {nlines = nlines + 1}
END {print nlines}
# Print all lines between start/stop pairs:
awk '/start/, /stop/' file
# Print all lines whose first field is different from previous one:
awk '$1 != prev { print; prev = $1 }' file
# Print column 3 if column 1 > column 2:
awk '$1 > $2 {print $3}' file
# Print line if column 3 > column 2:
awk '$3 > $2' file
# Count number of lines where col 3 > col 1
awk '$3 > $1 {print i + "1"; i++}' file
# Print sequence number and then column 1 of file:
awk '{print NR, $1}' file
# Print every line after erasing the 2nd field
awk '{$2 = ""; print}' file
# Print hi 28 times
yes | head -28 | awk '{ print "hi" }'
# Print hi.0010 to hi.0099 (NOTE IRAF USERS!)
yes | head -90 | awk '{printf("hi00%2.0f \n", NR+9)}'
# Print out 4 random numbers between 0 and 1
yes | head -4 | awk '{print rand()}'
# Print out 40 random integers modulo 5
yes | head -40 | awk '{print int(100*rand()) % 5}'
# Replace every field by its absolute value
{ for (i = 1; i <= NF; i=i+1) if ($i < 0) $i = -$i print}
# If you have another character that delimits fields, use the -F option
# For example, to print out the phone number for Jones in the following file,
# 000902|Beavis|Theodore|333-242-2222|149092
# 000901|Jones|Bill|532-382-0342|234023
# ...
# type
awk -F"|" '$2=="Jones"{print $4}' filename
# Some looping commands
# Remove a bunch of print jobs from the queue
BEGIN{
for (i=875;i>833;i--){
printf "lprm -Plw %d\n", i
} exit
}
Formatted printouts are of the form printf( "format\n", value1, value2,
... valueN)
e.g. printf("howdy %-8s What it is bro. %.2f\n", $1, $2*$3)
%s = string
%-8s = 8 character string left justified
%.2f = number with 2 places after .
%6.2f = field 6 chars with 2 chars after .
\n is newline
\t is a tab