Unix tips

Some Unix tips to manipulate biology and chemistry files

The idea is to have very short & simple scripts or one line codes
to perform simple tasks on chemistry and protein (or other) files

 

 

 

 

Preparing files
End of the line for different OS (eg invisible characters: control M on Windows...etc)

seen with:

cat -v filename

cat -t filename
(print special characters)


tr '\r' '\n' < macfile.txt > unixfile.txt
or
awk '{ gsub("\r", "\n"); print $0;}' macfile.txt > unixfile.txt

To convert a Unix file to Mac OS using awk, at the command line, enter:
awk '{ gsub("\n", "\r"); print $0;}' unixfile.txt > macfile.txt

Delete BOTH leading and trailing whitespace from each line
sed 's/^[ ^t]*//;s/[ ^]*$//' inputfile   

Remove control M in place in all xml files (warning this will change the files if originals are needed, duplicate them before):
perl -pi -e 's/\r/\n/g' *.xml     (may add double lines)


Simple sed for html
sed -e 's/target="_blank">http.*</target="_blank">LINK<\/a></g' my.html > my2.html
Tthis runs with sed on mac, warning with the different OS

Replace in csv file for instance, a ";" by a "tab". Ex with the results of FAF-Drugs filtering results.csv file
awk  '{gsub(";","\t",$0); print;}' myfile.csv

Sort things out
If in the file we have compound names and energy scores
Toto,12
Toto2,17
Toto3,5

To sort after the "," in reverse order
sort -t',' -k2 -r -n

Check compound ID in field 1 already present in the file
With in file1, AA_20 and AA_21 being the same compound but the poses are different (this is to get the unique compound IDs):
AA_20
BB_30
AA_21
CC_300
AB_30

awk -F'_' '!seen[$1]++' file1
will print this:
AA_20
BB_30
CC_300
AB_30

Note:
Here awk uses associative arrays to remove duplicates. When a pattern appears for the 1st time, $1, the count for the pattern is incremented. This will make the count as 0 since it is a post-fix, and the negation of 0 which is 'True' makes the pattern printed. When the same pattern appears again, the count is now 1 and hence the inverse is 'False' and hence the pattern does not get printed.

Remove duplicate compounds with Babel
babel  infile.xxx  outfile.yyy  --unique cansmiNS  
babel  infile.xxx  outfile.yyy  --unique cansmi    
babel  infile.xxx  outfile.yyy  --unique /nostereo/nochg   


Remove the first line of a file
awk 'NR>1' myfile.txt > output.txt

Sort on the 3rd field in reverse order
sort -r -k 3,3 input >  output

Keep unique lines without sorting in field 2
awk '!x[$2]++' input > output

Keep top 500 lines
head -500 infield > output
or
awk 'FNR <= 501’

Find the difference between two sorted files
diff file1 file2 | awk '/^>/{print $2 }'

MEAN, MEDIAN, MODE ...some codes from the forums....
Simple statistics - for much more, the R package is one solution
The mode in a list of numbers refers to the list of numbers that occur most frequently
Median  is the 'middle value' in your list

To round up using awk:
echo "767.992" | awk '{printf("%5.0f\n",$0)}'

If needed
awk prints only the first record, thus removing duplicates from the dat file without sorting
awk '!x[$0]++' myfile | sed '/^\s*$/d'
sed '/^\s*$/d'   to remove single blank lines after duplicates are removed
or other ways to remove blank line (maybe better than sed)
awk '!/^$/' myfile
or
awk 'NF > 0' filename
or
awk NF filename

if data separated with a "|"
awk -F\| '{printf("%5.0f\n",$0)}'

MEAN (for field 2)
cat myfile_with_separators_being_a_|_pipe | awk -F\| '{print $2}' | awk '{sum+=$1} END {print sum/NR}'

MEAN and MEDIAN (here also, data separated by a "|" pipe and done for field 2, thus $2
cat myfile | awk -F\| '{print $2}' | awk '{sum+=$1;a[x++]=$1;b[$1]++} END {print "Mean: " sum/x "\nMedian: "a[int((x+1)/2)] }'

Standard Deviation (field 2...)
cat myfile | awk -F\| '{print $2}' | awk '{sum+=$1; sumsq+=$1*$1;} END {print "stdev = " sqrt(sumsq/NR - (sum/NR)**2);}'

MODE (warning if no most frequent data are found, example for field 2 with many numbers, round-up before the process starts)
This could work for the entire file, then $0 should be used instead of $1 and the round-up step removed....depending your data
cat myfile | awk -F\| '{printf("%5.0f\n",$2)}' | awk '{++a[$1]}END{for(i in a)if(a[i]>max){max=a[i];k=i} print k}' > tmp_file
IF nothing is found, the output file is empty

Test if empty file, something like this ..many ways

awk NF try_empty.txt | awk 'END{print(NR>1)?"NOT-EMPTY":"EMPTY"}'
if in the file try_empty I have carriage return only, the output will be: EMPTY as it should

MODE
If I want only numbers before the "." (6.5, 8.5....)
awk -F\. '{for (i=1; i<=NF; ++i) if (max <= ++x[$i]) max = x[$i]} END { for (i in x) if (x[i] == max) print i }' mynumber.txt

MIN, MAX
seq 2 10 | | awk 'NR==1 {min=$1} NR>1 && $1<min { min=$1 } END {print min}'
seq 2 10 | awk 'NR==1 {max=$1} NR>1 && $1>max { max=$1 } END {print max}'

seq 2 10 | awk '{if(min==""){min=max=$1}; if($1>max) {max=$1}; if($1<min) {min=$1}; total+=$1; count+=1} END {print total/count, max, min}'

seq 10 | awk '{sum+=$1} END {print sum/NR}'
seq 10 | awk '{sum+=$1} END {print sum}'

MEAN, MEDIAN... again
in my file numbers.txt I have:
2.5
3.4
4
5.5
5.6
5
6
7

In a file named awk_mean_median_mode.txt i have:
(NB: If the file has headers on the first line, add NR > 1 in front of {col=$1})

{col=$1}{if((col ~  /^-?[0-9]*([.][0-9]+)?$/) && ($0!=""))                 
{
     sum+=col;
     a[x++]=col;
     b[col]++
     if(b[col]>hf){hf=b[col]}
}
}
END{n = asort(a);idx=int((x+1)/2)
     print "Mean: " sum/x
     print "Median: " ((idx==(x+1)/2) ? a[idx] : (a[idx]+a[idx+1])/2)
     for (i in b){if(b[i]==hf){(k=="") ? (k=i):(k=k FS i)}{FS=","}}
     print "Mode: " k
}


If i run:
awk -F'.' -f awk_mean_median_mode.txt numbers.txt
I have:
Mean: 4.28571
Median: 5
Mode: 5

If the number file contains:
2.5
3.4
4
5
6
7
 
 
Then i would get:
Mean: 4
Median: 4
Mode: 2,3,4,5,6

There is no most frequent number, thus mode prints all

 

--------------------------------

Split files
Warning check if end of lines are Linux/Unix-type

Split a file when finding a pattern, eg, pattern START
awk '/START/ { i++ } { print > "temp_txt" i }' myfile
will create ouput file temp_txt1...txt2...


split -l 200000 mybigfile.smi output_file
generate files with 200k lines (where -l is the number of lines in each file)
    output_fileaa
    output_fileab
    output_fileac
    output_filead...


Split large file and output many files with "about" 5 lines
awk -vc=1 'NR%5==0{++c}{print $0 > c".txt"}' Datafile.txt
AND THEN you can rename:
for filename in *.txt; do mv "$filename" "Prefix_$filename"; done;

If too many files in the directory and some boring warning after the command rm (remove) such as: argument list is too long
then one can try:
find all the files .smi and pipe them to rm
find . -name "*.smi" | xargs rm

Split file
awk 'NR%2==0' FILE > ALL_THE_EVEN_LINES
awk 'NR%2==1' FILE > ALL_THE_ODD_LINES

 

--------------------------------

 

Search for duplicated IDs in one or two files
Maybe one needs to:

awk '{print tolower($0)}' filename
to lower case

awk 'x[$1]++ == 1 { print $1 " is duplicated or present many times"}' myfile
find duplicate names or IDs in field 1

If we have:

File1: mydrug.txt
aspirin
acetylene
amphetamine

File2: molecule_wiki.txt
ammonia    N
aspirin    O=C(Oc1ccccc1C(=O)O)C
acetylene    C#C

 awk 'NR==FNR{a[$1]=$2;next} !($1 in a)' molecule_wiki.txt mydrug.txt
 print: amphetamine
 Meaning the molecule name present in the second file not already found in the wiki first file

Delete last line of a file on a Mac (this acts on the file itself)
sed -i '' -e '$ d' file1

Print specific field if a score is above a given value
For instance:
awk '{if ($9 > 5.0) print $9, "\t", $11, "\t", $13, "\t", $14, "\t", $15, "\t", $16}' myfile
If field 9 that contains score values is above 5 in energy value, print some other fields separated with a tab

Do not print some lines
grep -v ' -1000.00 '
Will remove lines with space -1000.00 and a space

Print the first 10 lines of a file (emulates "head -10")
awk 'NR < 11'
or if we change default FS
awk 'BEGIN {FS=","} NR < 2001 {print $1}'

Some simple examples to start
PDB file: Count residue in PDB file (of course you may have several chains and this can be checked)
awk 'BEGIN {counter = 0} {if ($3 == "CA") {counter++}} END {print counter}' myfile.pdb

To select lines with Atom in field 1 and Calpha (field 3) and get the amino acid name (present in field 4)
awk '$1 == "ATOM" && $3 == "CA" {print $4}' mypdb.pdb

or

awk '$1!="HETATM"' myfile.pdb | grep CA | awk '{print $4}' > myoutput.pdb
(to match exactly the pattern HETATM)
or
cat myPDBfile.pdb | awk '$1!="HETATM"' | grep CA | awk '{print $4}'
(NOTe:, the != negate the matching pattern)


Get the sequence from a PDB file (warning does not check for non standard amino acids and for different chains)
cat mypdb.pdb | awk '$1 == "ATOM" && $3 == "CA" {print $4}' | awk ' { gsub( /VAL/, "V"); gsub( /GLY/, "G"); gsub( /ALA/, "A"); gsub( /LEU/, "L"); gsub( /ILE/, "I"); gsub( /SER/, "S"); gsub( /THR/, "T"); gsub( /ASP/, "D"); gsub( /ASN/, "N"); gsub( /LYS/, "K"); gsub( /GLU/, "E"); gsub( /GLN/, "Q"); gsub( /ARG/, "R"); gsub( /HIS/, "H"); gsub( /PHE/, "F"); gsub( /CYS/, "C"); gsub( /TRP/, "W"); gsub( /TYR/, "Y"); gsub( /MET/, "M"); gsub( /PRO/, "P"); residues = residues $1} END {print residues }'
or
awk '$1 == "ATOM" && $3 == "CA" {print $4}' MYPDB_file.pdb | awk ' { gsub( /VAL/, "V"); gsub( /GLY/, "G"); gsub( /ALA/, "A"); gsub( /LEU/, "L"); gsub( /ILE/, "I"); gsub( /SER/, "S"); gsub( /THR/, "T"); gsub( /ASP/, "D"); gsub( /ASN/, "N"); gsub( /LYS/, "K"); gsub( /GLU/, "E"); gsub( /GLN/, "Q"); gsub( /ARG/, "R"); gsub( /HIS/, "H"); gsub( /PHE/, "F"); gsub( /CYS/, "C"); gsub( /TRP/, "W"); gsub( /TYR/, "Y"); gsub( /MET/, "M"); gsub( /PRO/, "P"); residues = residues $1} END {print residues }' > myoutput.seq

Distance between 2 atoms in a file - simple test1
if I have only the x, y, z coordinates of 2 atoms in the file: coordinates

    x         y            z
------------------    
50.211 14.979 24.196   (note: this is my atom x1)
50.142 15.162 25.415   (note: this is my atom x2)

The distance between these 2 atoms is:
sqrt((x1-x2)^2 + (y1-y2)^2 + (z1-z2)^2)

awk '{printf "%s\t", $0}' coordinates > my_2_atoms_in_a_row.txt
to make 1 row of my columns

I get this in my file my_2_atoms_in_a_row.txt:

   $1          $2          $3          $4        $5         $6         
   x1          y1         z1           x2        y2          z2
-----------------------------------------
50.211   14.979   24.196    50.142   15.162   25.415

awk '{ a=sqrt(($1-$4)^2 + ($2-$5)^2 + ($3-$6)^2); print a}' my_2_atoms_in_a_row.txt
result is 1.23459 (here in angstroms)

Distance between 2 atoms in a file - test2
Now, if i have in my file coordinates:
50.211 14.979 24.196   (my atom x1)
50.142 15.162 25.415   (my atom x2)

Then
awk 'NR ==1 {split($0,a," ")}; NR == 2 {split($0,b," ")} END {print sqrt((a[1]-b[1])^2 + (a[2]-b[2])^2 + (a[3]-b[3])^2)}' coordinates
Result = 1.23459
or:
awk 'NR ==1 {split($0,a," ")}; NR == 2 {split($0,b," ")} END {printf "%.3f\n", sqrt((a[1]-b[1])^2 + (a[2]-b[2])^2 + (a[3]-b[3])^2)}' coordinates
Result = 1.235

 

 

 


A simple example to combine data from two files
Using getline and AWK and a file with the coordinates of 4 molecules and another file with the compound IDs in the same order of these 4 molecules
I want to add the compound IDs to the coordinate file

if in a file called molecule.txt i have:
ID 44000
jkjk_coordinates...
jkk_coordinates...
END

ID 400009
jkkk_coordinates...
mvnvbc_coordinates...
END

ID 58939
jkd_coordinates...
jkjd_coordinates...
END

ID 400009
jj_coordinates...
KKL_coordinates...
END

Thus four molecules, ending with END and with an empty line at the end of the file (important to have the empty line here)

and a file called molecule_ID_numbers.txt with 4 lines and 4 ID numbers in the same order than the molecular coordinates in the file molecule.txt:
7888
9000
10000
15000

I can use getline var and AWK
This form of the getline function takes its input from the file molecule_ID_numbers.txt and puts it in the variable var


The following script reads its input record from the file molecule.txt when it encounters a first field with a value equal to ID in the current input file. It also adds an empty line after END and <CAS NUMBER> and the ID values present in the file molecule_ID_numbers.txt

awk '{ if ($1 == ID) {getline var < "molecule_ID_numbers.txt" ; print "\n", "\<CAS NUMBER\>" "\n", var, "\n\n"} else print}' < "molecule.txt" > myoutfile


The output is :
ID 44000
jkjk
jkk
END

 <CAS NUMBER>
 7888


ID 400009
jkkk
mvnvbc
END

 <CAS NUMBER>
 9000


ID 58939
jkd
jkjd
END

 <CAS NUMBER>
 10000


ID 400009
jj
END

 <CAS NUMBER>
 15000

--------------

Merge two files after a matching pattern
If have two files:
file1_smiles_score.txt
Smiles  name    score
CCC  drug1  12
CCCCC  drug2     10
CCCCCCC  drug3     8

file2_info_drug.txt
disease name    vendor  cost
type1   drug1  vendor1  high
type2   drug2   vendor2  low
type3   drug3   vendor3 low
type4   drug4   vendor4 unknown

If i run:
awk '{print FILENAME, NR, FNR, $0}' file2_info_drug.txt file1_smiles_score.txt

file2_info_drug.txt 1 1 disease name    vendor    cost
file2_info_drug.txt 2 2 type1    drug1  vendor1    high
file2_info_drug.txt 3 3 type2    drug2    vendor2  low
file2_info_drug.txt 4 4 type3    drug3    vendor3    low
file2_info_drug.txt 5 5 type4    drug4   vendor4    unknown
file2_info_drug.txt 6 6
file1_smiles_score.txt 7 1 Smiles    name    score
file1_smiles_score.txt 8 2 CCC  drug1  12
file1_smiles_score.txt 9 3 CCCCC  drug2     10
file1_smiles_score.txt 10 4 CCCCCCC  drug3     8
file1_smiles_score.txt 11 5

NR : gives the total number of records processed
FNR : gives the total number of records for each input file

Syntax:
In awk assigning array elements:
arrayname[string]=value

    arrayname is the name of the array
    string is the index of an array
    value is the value you are assigning to that element of the array

and the "for loop" is something like:
for (var in myarrayname)
actions

Examples:
echo 1 2 3 4 | awk '{my_arrayname[$1] = $3};END {for(i in my_arrayname) print my_arrayname[i]}'
3

echo 1 2 3 4 | awk '{my_arrayname[$1] = $4};END {for(i in my_arrayname) print my_arrayname[i]}'
4

"i" is the index
A "for loop" is needed to iterate and print content of an array

Now if I need to merge file1 and file2 by the drug-names present in field 2 of each file
And I need to add the 3rd column of file 2 to file 1
Something like this:
Smiles  name    score     vendor
CCC  drug1      12       vendor1


awk 'NR==FNR {myarray[$2] = $3; next} {print $1,$2,$3,myarray[$2]}' file2_info_drug.txt file1_smiles_score.txt
Smiles name score vendor
CCC drug1 12 vendor1
CCCCC drug2 10 vendor2
CCCCCCC drug3 8 vendor3


Notes found in awk manuals and forums:
awk 'NR==FNR {myarray[$2] = $3; next} {print $1,$2,$3,myarray[$2]}' file2.txt file1.txt

First read file2 (NR==FNR is only true for the first file (argument), FNR refers to the record number, typically the line number in the current file, NR refers to the total record number, ie, NR keeps on increasing
"next" means any further commands are skipped and are run on files other than the first one in the argument

The command saves column 3 of file2 in hash-array using column 2 (the names of the drugs) as key: myarray[$2] = $3
Then read file1 and output fields $1,$2,$3, appending the corresponding saved column from hash-array myarray[$2]
This solution should work even if the data are not in the same order, no need to sort
The: myarray[$2] = $3  saves $3 as the value and $2 as the key
It matches exactly the second column from both files


If I need to take the entire lines of file2 and merge them still by the name of the drugs present in field2
awk 'NR==FNR {myarray[$2] = $0; next} {print $1,$2,$3,myarray[$2]}' file2_info_drug.txt file1_smiles_score.txt
Smiles name score disease name    vendor    cost
CCC drug1 12 type1    drug1  vendor1    high
CCCCC drug2 10 type2    drug2    vendor2  low
CCCCCCC drug3 8 type3    drug3    vendor3    low

If i need to add fields 3 and 4 of file2 to file1
awk 'NR==FNR {myarray[$2] = $3; add_info[$2] = $4; next} {print $1,$2,$3,myarray[$2],add_info[$2]}' file2_info_drug.txt file1_smiles_score.txt
Smiles name score vendor cost
CCC drug1 12 vendor1 high
CCCCC drug2 10 vendor2 low
CCCCCCC drug3 8 vendor3 low

--------------

Some Unix and Awk, dealing with files

Combine two files

paste -d, file1 file2
the "," is now the separator, default paste separator is tab

Read lines in both the files alternatively
paste -d'\n' file1 file2

 

Create a file with Nano on mac for instance:
nano filetest.txt

then
seq 3 | xargs -I{} cp filetest.txt filetest{}.txt
ls -lrt

filetest.txt
filetest3.txt
filetest2.txt
filetest1.txt

awk '{print $1}' filetest{1..3}.txt
1
4
1
4
1
4

awk '{print $1}' filetest{1..3}.txt | column -t
1
4
1
4
1
4

awk '{print $1, $2}' filetest{1..3}.txt | column -t
1  2
4  4
1  2
4  4
1  2
4  4


With 2 files, you can find this example on the www:
awk 'FNR==NR{a[FNR]=$1; next}{print a[FNR],$1}' myfile1.txt myfile2.txt > output.txt

The AWK variable FNR is the line number of the current input file and NR is the line number of the input. The two are equal only while the first input file is being read.
The first fields of the first file are saved in the a array (a[FNR]=$1) whose keys are line numbers and whose values are the 1st fields. Then, when the second file is reached, one prints the value corresponding to its line number (a[NR]) and the current line's 1st field.



Assuming each of your files has the same number of rows:

awk -f script.awk file1.txt file2.txt file3.txt file4.txt

Contents of script.awk:

FILENAME == ARGV[1] { one[FNR]=$1 }
FILENAME == ARGV[2] { two[FNR]=$3 }
FILENAME == ARGV[3] { three[FNR]=$7 }
FILENAME == ARGV[4] { four[FNR]=$1 }

END {
    for (i=1; i<=length(one); i++) {
        print one[i], two[i], three[i], four[i]
    }
}

NB:
By default, awk separates columns on whitespace. This includes tab characters and spaces, and any amount of these.
This makes awk ideal for files with inconsistent spacing.


warning about the lack of space <(
paste <(awk '{print $1}' filetest3.txt) <(awk '{print $3}' filetest.txt)
paste <(awk '{print $1}' filetest3.txt) <(awk '{print $0}' filetest.txt) > file4


-----

Print lines after string-match with AWK
input myfile is:
@<TRIPOS>MOLECULE
1
ghh
tripos
TRIPOS

@<TRIPOS>MOLECULE
2
kkl

kkl
TRIPOS

@<TRIPOS>MOLECULE
3
llll
toto

---
The getline command reads the next line from the file


awk '/@<TRIPOS>MOLECULE/ {getline; print $0}' myfile
or
awk '/@<TRIPOS>MOLECULE/ {getline; print;}' myfile
or
awk '/@<TRIPOS>MOLECULE/ {getline; print}' myfile
print this:
1
2
3

awk '/@<TRIPOS>MOLECULE/ {print;getline;print;}' myfile
print this:
@<TRIPOS>MOLECULE
1
@<TRIPOS>MOLECULE
2
@<TRIPOS>MOLECULE
3

Print 2 lines after a string
awk '/@<TRIPOS>MOLECULE/{x=NR+2;next}(NR<=x){print}'  myfile

Print 1 line after the first string is found and stop
awk '/@<TRIPOS>MOLECULE/ {getline; print;exit}' myfile

-----
If in a Mol2 file we have:

@<TRIPOS>MOLECULE
compoundID_000

I need to fetch all the compound IDs below @<TRIPOS>MOLECULE but I do not want the numbers after the underscore
I can try with:
awk 'BEGIN {FS="_" } /@<TRIPOS>MOLECULE/ {getline; print $1}' myfile


fetch the lines after @<TRIPOS>MOLECULE or after Energy
awk '/@<TRIPOS>MOLECULE|Computed Energy/ {getline; print $1}' small_database_no_underscore.mol2


print two lines into one
input file is:
9999899989
9
ZINC00006468999
3
ZINC00006468
7
9999899985
5

the large numbers are compound IDs, and the number below is the energy score

This will print the energy first and the compound IDs
awk 'BEGIN{i=1}{line[i++]=$0}END{j=1; while (j<i) {print line[j+1] , line[j]; j+=2}}'  myfile

I can combine the two:

awk '/@<TRIPOS>MOLECULE|Computed Energy/ {getline; print $1}' small_database_no_underscore.mol2 | awk 'BEGIN{i=1}{line[i++]=$0}END{j=1; while (j<i) {print line[j+1] , line[j]; j+=2}}' > outputfile1

to print this:
9 9999899989
3 ZINC00006468999
7 ZINC00006468
5 9999899985

Another way to print two lines into one:

cat myfile | paste -d, - - > ouput
Cat is not needed in fact but the writing of the command is more clear: < myfile paste -d, - - > output

paste can read different files. Instead of a file name, we can use - (dash). Paste takes the first line from file1. Then, it wants to read the first line from file2.
However, since the first line of stdin is already read and processed, the next input is to read and process the second line. This glues the second line to the first


If a good energy is 9 and bad is 3, I can sort based on the energy, I can do:
sort -n -r -k1,1  outputfile1
the -n is number, -r is reverse order, -k1,1 is first field
i get:
9 9999899989
7 ZINC00006468
5 9999899985
3 ZINC00006468

Then i can fetch only the compound IDs after sorting

sort -n -r -k1,1  outputfile1 | awk '{print $2}' > ID_numbers_sorted.txt

Combine all to try:

awk '/@<TRIPOS>MOLECULE|Computed Energy/ {getline; print $1}' small_database_no_underscore.mol2 | awk 'BEGIN{i=1}{line[i++]=$0}END{j=1; while (j<i) {print line[j+1] , line[j]; j+=2}}' | sort -n -r -k1,1  | awk '{print $2}'  > ID_numbers_sorted.txt

or with the underscore in the database for the docked compound IDs (here it can be different poses of the docked compounds, _01, _02...) and with cat:
cat small_database_with_underscores_on_IDs.mol2 | awk 'BEGIN {FS="_" } /@<TRIPOS>MOLECULE|Computed Energy/ {getline; print $1}'  | awk 'BEGIN{i=1}{line[i++]=$0}END{j=1; while (j<i) {print line[j+1] , line[j]; j+=2}}' | sort -n -r -k1,1  | awk '{print $2}' > ID_numbers_sorted.txt

If i have in the mol2 file:

@<TRIPOS>MOLECULE
93894861_008
 52 54 0 0 0
SMALL
NO_CHARGES


@<TRIPOS>ATOM
1    Cl       31.1781   13.6798   13.8999  Cl
    51   22   49 1
    52   22   50 1
    53   29   51 1
    54   29   52 1
@<TRIPOS>PROPERTY_DATA
  Total_Score   | 4.9061

It means I have the underscore on the compound IDs, this I want to remove and the Energy score is not written below a line that contains a specific string, but on the same line than a string

thus i have to change the above with:

 

cat small_database_with_underscores_on_IDs.mol2 | awk 'BEGIN {FS="_" } /@<TRIPOS>MOLECULE/ {getline; print $1} /Total/ {print $NF}' | awk 'BEGIN{i=1}{line[i++]=$0}END{j=1; while (j<i) {print line[j+1] , line[j]; j+=2}}' | sort -n -r -k3,3 | awk '{print $4}' > ID_numbers_sorted.txt

One intermediate view is like this:

Score    | 2.2927 98268304
Score    | 3.9573 99806113
Score    | 3.7241 99884208
Score    | 4.7876 99970627

Thus i need to sort on the 3rd column only (-k3,3) and then the compound IDs are in field 4 (thus the print $4)

If underscores are a problem in the Mol2 file, they can be changed, for instance with 

cat mymol2file.mol2 | awk '{gsub(/\<NO_CHARGES\>/,"NO-CHARGES")}1' | awk '{gsub(/\<Total_Score\>/,"Total-Score")}1' > myoutput.mol2

gsub is global substitution and the \< \> to mark the edge of the string to substitute to avoid changing other words

other can be: awk '{gsub(/pattern/,"replacement",column_number)}' inputfile to act on specific field

like: awk '{gsub(/1/,"0",$1);}' file

exact match also with
awk '/^toto$/' myfile
the ^ and $ for start and end of the line

 

To check the number of lines in the compound ID file, possible to use:

awk 'NF{c++}END{print "total: "c}' myfile

instead of wc -l (may miss the last line depending how it ends)


see below for this one:
./extract_mol2_awk_script ID_numbers_sorted.txt small_database_no_underscore.mol2 MYCOMPOUND_ENERGY_SORTED.mol2

-----------

Extract molecules from Mol2 file if it matches a list of compound IDs present in another file

For example in one file you have the compound ID like this:
5247489
5523453
6072484
6190013

Also, important to remember it possible with Awk to define the Record Separator and the Field Separator, this way it can easier to handle MOL2 more easily. In this case, $0 (which is usually field 1) is now the complete description of a single molecule, excluding the starting signal like Tripos<molecule>.... $1 is the first line of the file (commonly used as a title with maybe the molecule name or not).

It is possible to pass Shell variable to awk, different ways...for exemple in a file you can have:
var=5
var2=7

awk -v gg=$var -v iii=$var2 'BEGIN {print gg"\t" iii}'


then, inside a file called for instance extract_mol2_awk_script, you can have this small script:
---
if [ $# != 3 ]; then echo "Usage: ./shell_file  list_of_compound_IDs_one_per_line_no_space  large_Mol2_bank.mol2  output_my_extracted_molecules.mol2" ; exit
fi

mylist=$1
largeMol2bank=$2
extractedMol=$3

input_list_id=`cat $mylist`

for i in $input_list_id
do
echo $i

echo "molecule mol2" $i "extracted"
awk -v myvar=$i 'BEGIN {RS="@<TRIPOS>MOLECULE" ; FS="\n"} $2==myvar {print "@<TRIPOS>MOLECULE" ; print $0 }' $largeMol2bank | awk 'NF' >> $extractedMol
done

echo "The extracted molecules are in the file:" $extractedMol

# Comments
# awk -v myvar=$i 'BEGIN {RS="@<TRIPOS>MOLECULE" ; FS="\n"} $2==myvar {print "@<TRIPOS>MOLECULE" ; print $0 }' $largeMol2bank | awk 'NF' >> $extractedMol
# With awk, NF only set on non-blank lines. When this condition matches, awk default action is to print will the whole line, this does remove empty lines
# It seems that without the blank lines, the file is ok, but to double check
#
# if the number ID in the large mol2 file has underscores, like this is a special pose number
# and this is not present in your list of compound IDs (that are usually the ID to order compounds)
# one can use: cut -f1 -d '_'  small_database.mol2 > small_database_no_underscore.mol2
# the -f1 takes what is before the delimiter, defined here as "_", thus via -d_ or -d '_'
# Outputs the first field (-f1) of the file where fields are delimited by underscore (-d '_')
# check there are no other places with underscore that could be damaged by this command
#
# to use this script written into a file (extract_mol2_awk_script) made executable by chmod + x
# do: ./extract_mol2_awk_script   my_IDs   small_database_no_underscore.mol2  MYOUTPUT.mol2
# the IDs can be 990900 (often compound IDs are numbers but can be something else
# this works for instance if the compound ID is below @<TRIPOS>MOLECULE
# end

--------------

Swap ID names in SDF file before the coordinates

This can really be a pain as some packages are very very sensitive to non standard SDF format
For instance, if the file starts with a blank line (this will print after the first line: awk 'NR>1' filename), the program may crash, if there is a blank - empty line after $$$$, it may crash etc

If each compound start with a string ISIS for instance, but with blank lines after the $$$$ etc, one can try some cleaning before using the name swap script:

cat my3D.sdf | awk '/ISIS/{print "ZZZZZ"}1' | awk -v n=-2 'NR==n+1 && !NF{next} /\$\$\$\$/ {n=NR}1' | awk 'NR>1' > myOUTPUT.sdf

This will put ZZZZZ before ISIS, then delete the empty line after $$$$ and remove the first blank line of the file

then Swap script:

We can have this short script in a file, do the chmod + x and can run with ./myscript

 

# We have here all compounds that start with something like ISIS blabla
# After the coordinates, assuming we have a flag: > <Name> and one line after the name or ID of the compound
#
#
echo -n "select the SDF file where you want to move or swap the cmp names before the coordinates:"
read -e file1
grep "\<Name\>" -A1 $file1 -c
# to count the number of IDs in the SDF file
grep "\<Name\>" -A1 $file1 | grep -v "<Name>" | grep -v "\-\-" | sed -e 's/ /-/g' > only_my_cmpd_name_from_the_SDFbank.txt
# this print the line with >  <Name> and the compound ID below and two -- between some compound names
# the sed action is to replace white space in a name by - thus to something like compound-name-full
sleep 1
# sleep is not needed
echo step_1_grep_all_the_drug_names_done
grep "\<ZZZZZ\>" $file1 -c
sleep 1
echo step_2_grep_all_the_ISIS_name_on_top_of_each_cmpd_this_number_should_be_the_same_as_the_first_number_printed_above
awk '{
    if ($1 == "ZZZZZ") {
        getline < "only_my_cmpd_name_from_the_SDFbank.txt"
        print
        } else
            print
}' < $file1 > "${file1%.*}"_swapped_name.sdf
sleep 1
rm only_my_cmpd_name_from_the_SDFbank.txt
echo "this is done"
# Some notes
# grep avec slash is to be specific on the string Name to exact
# Here we need to move the name of the drug compound in a SDF file located after the field Name
# on top of the compound to replace the word ISIS. This is when the 3D
# structure is generated with some packages this is needed as input for docking tools otherwise the name or ID could be lost
# First read the sdf file in a variable file1
# Then find the names and put them into a file
# then with awk, when i find ISIS in the $file1, thus the SDF file, with Awk, i getline in the file with only the name of the drug
# and i replace ISIS by the name of the drug.
# Then we write all in the root name of the file, for instance if on has myfile.sdf as input, the script takes what is before the dot and add the extension _swapped_name.sdf
# the read -e allows auto complete with tab when reading the file name
# in awk  == should be exact matching
#
# in the SDF file, after the names have been moved, one may want to remove the compound ID after >  <Name> to avoid having it twise, can be a problem
# for some project
# one can try, seems to run on mac as well:
# sed -e '/\<Name\>/ { N; d; }' mysdffile.sdf
# the {} gets executed if pattern found, the N is for next line and d is to delete the current line with Name and the line after with the ID
# the "\<mystring\>" as in 's/\<mystring\>/toto/g' is for exact match and g for global
# with awk something like this:
# awk '{sub(/\<mystring\>/, "toto")}1' inputfile > output
#

--------------

Extract the top X molecules of a large mol2

This script has to be a in file, then chmod + x and run with, depending the system, with: ./myscript

#

#

myinputbank=$1

howmany_cmpd_doyou_want=$2

fixcountloop=$(($howmany_cmpd_doyou_want + 1))


echo "the top " $howmany_cmpd_doyou_want " molecules will be written in Top_extracted_molecules.mol2"

awk  'BEGIN { RS="@<TRIPOS>MOLECULE" ; FS = "\n" ; count=0 }

                $1=="@<TRIPOS>MOLECULE"
                { count++
                if ( count < 2 )
                print  $0
                else if (( count > 1) && (count <= howmany ))
                print "@<TRIPOS>MOLECULE" $0 > "Top_extracted_molecules.mol2"

            
                
                 }' howmany=$fixcountloop $myinputbank
                
exit  
fi          

 
# usage ./this_script largeINPUTbank.mol2  number_of_mol_i_want_to_extract

#

#

 

--------------

Extract molecules from SDF files

SD files exported from some packages do not follow the SDF format. You may need to add something, like some flags...for instance you may need to add a line with 'M  END' before every terminating '$$$$'. This can be done like this. The output field separator (OFS) should be set to null to prevent additional newlines (test this by setting OFS="\n", the default).

awk 'BEGIN {RS="\\$\\$\\$\\$\n"; OFS=""} {print $0,"M  END\n$$$$"}' badSDFformat.sdf > Clean.sdf

SDF file should be
Compound_IDs
coordinates
$$$$
Compound_IDs

(no space between $$$$ and compound_IDs)


To extract one molecule from one SD file, we assume the ID number starts each compound in field 1, the end of the compound is always $$$$
One can try :

awk 'BEGIN {RS="\\$\\$\\$\\$\n"; FS="\n"} $1==patulin  {print $0; print "$$$$"}' largefile.sdf > one_molecules.sdf

or if it add twice the $$$$
awk 'BEGIN {RS="\\$\\$\\$\\$\n"; FS="\n"} $1==patulin  {print $0}' largefile.sdf > one_molecules.sdf

Extract a list of molecules, if you have the IDs of the compounds in a file and the large SDF file in another file
In one file you have the ID like this:
9154647
9155875...


With this script in a file and chmod +x:
#
if [ $# != 3 ]; then echo "Usage: ./shell_file  list_of_IDs_one_per_line_no_space  large_SDF_bank.sdf  file_name_of_extracted_molecules.sdf" ; exit
fi

mylist=$1
largeSDFbank=$2
extractedMol=$3

input_list_id=`cat $mylist`


for i in $input_list_id
do
echo $i


echo "molecule" $i "extracted"
awk -v myvar=$i 'BEGIN {RS="\\$\\$\\$\\$\n" ; FS="\n"} $1==myvar {print $0}' $largeSDFbank >> $extractedMol
done


echo "The extracted molecules are in the file:" $extractedMol

# Usage: ./this_shell_file    list_of_IDs_one_per_line_no_space     large_SDF_bank.sdf   file_name_of_extracted_molecules.sdf "
# NOTE WARNING MAYBE YOU DO NOT NEED TO REMOVE BLANC LINES WITH THE FIRST SED COMMAND NOR ADD AN EMPTY LINE BEFORE DOLLARS
# IF SO DELETE THE SLEEP AND SED....LINES AND END WITH echo "The extracted molecules are in the file:" $extractedMol
# if $$$$ missing at the end use:
# awk -v myvar=$i 'BEGIN {RS="\\$\\$\\$\\$\n" ; FS="\n"} $1==myvar {print $0 ; print "$$$$"}' $largeSDFbank >> $extractedMol
#
#

 


Get some compounds at random (pseudo) from SDF file
We get the SDF file from Pubchem
Each compound starts with ID number
Ends with $$$$ and no empty line between the $$$$ and the following ID number
I can try something like this to get the compound IDs and then select at random X compound IDs from the list
and put the output in a new SDF file
On mac, this requires to do: brew install coreutils
Such as to have gshuf

Copy the script below in a file myscript_random_sdf, do the chmod +x and run with ./myscript_random_sdf

# the read -e allows tab auto-completion for the file names
# echo with some special info to color the text
#echo -n "select the file with the ID of molecules, 1 ID per line no space: "

echo -n -e "\033[1;31mselect the file with the ID of molecules, 1 ID per line no space: \033[0m"
read -e file1


echo -n -e "\033[1;31mselect the large SDF file, each compound should start by ID and end with 4 dollars : \033[0m"
read -e file2

echo -n -e "\033[1;31menter the name of the output file for your selected molecules ending with .sdf : \033[0m"
read myoutputsdf

echo -n -e "\033[1;31mhow many compounds do you want to extract at pseudorandom from the ID file : \033[0m"
read random_selection

jo=`cat $file1 | gshuf -n $random_selection`
for i in $jo
do
echo "molecule" $i "extracted"

awk -v myvar=$i 'BEGIN {RS="\\$\\$\\$\\$\n" ; FS="\n"} $1==myvar {print $0 ; print "$$$$"}' $file2 >> $myoutputsdf
done

echo "The extracted molecules are in the file:" $myoutputsdf

# GNU shuf or gshuf is installed on mac with: brew install coreutils
# tested like this: seq 100 | gshuf -n 3
# sequence of 100 and random select 3 numbers from 100

# The read -e should allow autocomplete of filenames with tab
# To get the compound IDs from the SDF file I can do something simple
#
# awk '/\$\$\$\$/ {getline; print $0;}' mysdffile
# but this can print the last $$$$ if there is not empty line after
# I can add some new empty lines at the end of the file with:
#
# I get the first line of mysdf file with : awk 'NR==1' myfile
# add two new lines at the SDF file only if there are none
# awk '/^$/{f=1}END{ if (!f) {print "\n\n"}}1' mysdf > output
# or print some empty lines in all case
# awk 'END {print "\n\n"}1' mysdf > output
# remove empty lines if needed: awk 'NF' myfile
# If i combine all I can have something like this to get all the compound IDs of the pubchem SDF file:
# cat myPubChemfile.sdf | awk 'END {print "\n\n"}1' | awk 'NR==1 ; /\$\$\$\$/ {getline; print $0;}' | awk 'NF' > myIDs.txt
#

#warning, this may add $$$$ twice if it extracts the last compound of the SDF file, this has to be checked and if so deleted
#could be changed in the script but no time. For some software packages it does not matter for others, it may crash the tool

 

 

 

Find string in common between two files without sorting
File1:
10
20
30
3000

File2:
10
3000

grep -F -f file1 file2
Print:
10
3000

the -F, --fixed-strings
-f FILE, --file=FILE
Can be slow on large files

Warning grep can do some strange things (regular expression versus plain string, blank lines..)

or equivalent with awk (keep the order of file 2)
awk 'NR==FNR{a[$1]++;next} a[$1]' file1 file2


awk 'NR==FNR{arr[$0];next} $0 in arr' file1 file2
This will read all lines from file1 into the array arr[], and then check for each line in file2 if it already exists within the array (i.e. file1). The lines that are found will be printed in the order in which they appear in file2. The comparison in arr uses the entire line from file2 as index to the array, so it will only report exact matches on entire lines.
or
find what is not in common
awk 'NR==FNR {exclude[$0];next} !($0 in exclude)' file1 file2

print:
20
30

Find IDs in common or different in two files, some kind of Venn diagram could be done
File1
2000_2
3000_1
8800_00
304_09

File2
8800_00
304_09

unique data in file1:
awk 'BEGIN {FS = "_"} NR==FNR{a[$1];next}!($1 in a)' file2.txt file1.txt

unique data in file2:
awk 'BEGIN {FS = "_"} NR==FNR{a[$1];next}!($1 in a)' file1.txt file2.txt

common ID numbers in two files:
awk 'BEGIN {FS = "_"} NR==FNR{a[$1];next} ($1 in a)' file1.txt file2.txt

NR==FNR    - Execute next block for 1st file only
a[$0]      - Create an associative array with key as '$0' (whole line or $1 for field 1) and copy that into it as its content
next       - move to next row
($0 in a)  - For each line saved in `a` array:
             print the common lines from 1st and 2nd file "($0 in a)' file1 file2"
             or unique lines in 1st file only "!($0 in a)' file2 file1"
             or unique lines in 2nd file only "!($0 in a)' file1 file2"
             
To check I can merge the 3 output files uniquefile1, uniquefile2 and commonFile1_and_2
awk 'BEGIN {FS = "_"} {print $1}' all_output_merged.txt | sort -n | uniq -d
 
In order to print the duplicate lines, one way here as we have only one field with numbers:
(it does not say how many times the string is found)
sort file1.txt | uniq -d             
             
Or with AWK (it does not say how many times the string is found)
awk '{i=1;while(i <= NF){a[$(i++)]++}}END{for(i in a){if(a[i]>1){print i,a[i]}}}' file1.txt
 

 

--------------

Compare old and new list of SMILES and drugs - diff and here sdiff to see side by side differences only for fields 1 and 5, which could be SMILES strings and field 5 the name of the molecules
sdiff -a -i -s -WB -w 300 <(awk '{print $1, "\t", $5}'  old_database_drugs.tsv) <(awk '{print $1, "\t", $5}'  last_database_drugs.tsv) > found_differences_old_new_files

-w change default width to print on the terminal, 300 is more than the default

-i ignore-case, -W ignore-all-space -B ignore blank lines -s suppress commonlines, -a for text file

 

Some double checking of the diff command just above

If i have one file with an old version of compounds in smiles and a new one, with some additional compounds here and there
I could cat the two files and remove duplicate lines
Something like this could be tried

This remove duplicate lines, ignore case
awk '!seen[toupper($0)]++' file

If no problems with upper and lower, remove duplicate lines
awk '!seen[$0]++' file

Here this will show the extra duplicate lines
awk 'seen[$0]++' file

if extra white space between strings (Tab not changed), remove duplicate lines
(will not work if extra white space at the end of last string for instance)
awk '{gsub(/[ ]+/," ")}!seen[$0]++'

This will remove extra white space and also Tab
awk '$1=$1' file

If combine with upper-lower and extra white space to remove duplicate lines
cat fichier.txt | awk '$1=$1' | awk '!seen[$0]++'

For instance in file we have:
Aspirin     23   
aspirin    23
vitamin    CCCC
vitamin    CCCC
vitamin    CCCC
VITamin          CCCC         

Thus different spaces, some strings are followed by Tab
This:
cat fichier.txt | awk '$1=$1' | awk '!seen[toupper($0)]++'

will remove duplicate file print:
Aspirin 23
vitamin CCCC

 

--------------


AWK and SED
An awk program is a sequence of statements of the form:
        pattern   { action }
        pattern   { action }

Each  line  of  input  is matched against each of the patterns.  For each pattern that matches, the  associated  action  is
executed.  When all the patterns have been tested, the next line is fetched and the matching starts over.


Examples
awk 'NR == 1 {print  $2}' mydatatfile
This says select line one and do action {..}, here print field 2

awk 'length > 10' myfile
This prints every line that is longer than 10 characters

awk '{ $1 = log($1); print }' myfile
Replaces the first field of each line by its logarithm

awk '$2 ~ /A|B|C/' myfile
Prints all input lines with an A, B, or C in  the  second  field

awk '$2 ~/0.1/' myfile > myoutput
Prints all lines with 0.1 in the second field and copy them in file myoutput

awk '{print "end"; print $0}' myfile
This prints end in between each line

awk '{print "\$" $0}' myfile
prints $ infront each line

sed 's/\$/\"/'
substitutes each $ by "

sed 's/\> \<ID\>/\> \<ID_STRUCTURE\>/g' file_input.sdf > fileoutput.sdf
substitutes each > <ID> by > <ID_STRUCTURE>

This prints the last field of each line
awk '{ print $NF }' myfile

Print the last field of the last line
awk '{ field = $NF }; END{ print field }' myfile

Print every line with more than 4 fields
awk 'NF > 4' myfile

Print every line where the value of the last field is > 4
awk '$NF > 4' myfile

Some other examples
awk 'BEGIN {i=1; while (i<=10){ print i*i; i++}}'

awk 'BEGIN {col = 13; {print col}}'

awk 'BEGIN {lines=0} {lines++} END {print lines}' myfile
(somehow like wc -l)

Change the field separator:
if in myfile i have:
uuuu:kkkk:lllll5676
uuuu:kkkk:lllll8999
uuuu:kkkk:lllll00999

awk 'BEGIN {FS=":"} {print $2}' myfile
I force the separator to be : and I keep field 2

awk '$3 == 0 {print $1}' myfile1 myfile2
If field 3 = 0, print field 1 of my two files

awk '$2 > 0.5 {col = col +1} END {print col}'
prints the number of time you have numbers above 0.5 in field 2

Print lines with the word "brian" in them:
awk '/brian/   { print  $0 }' myfile

Print each input line preceded with a line number
print the heading which includes the name of the file
awk 'BEGIN  {  print "File:", FILENAME } { print NR, ":\t", $0 }' myfile

awk '{name = name $1} END {print name}'
Print all on one line

Insert 5 blank spaces at beginning of each line
awk '{sub(/^/, "     ");print}'

Substitute "foo" with "bar" EXCEPT for lines which contain "baz"
awk '!/baz/{gsub(/foo/, "bar")};{print}'

Change "scarlet" or "ruby" or "puce" to "red"
awk '{gsub(/scarlet|ruby|puce/, "red"); print}'

Remove duplicate, consecutive lines (emulates "uniq")
awk 'a !~ $0; {a=$0}'

Remove duplicate, nonconsecutive lines
awk '! a[$0]++'                     # most concise script
awk '!($0 in a) {a[$0];print}'      # most efficient script

Print the first line
awk 'NR <2' test

Print the last line of a file (emulates "tail -1")
awk 'END{print}'

Print only lines which match regular expression (emulates "grep")
awk '/regex/'

Print only lines which do NOT match regex (emulates "grep -v")
awk '!/regex/'

Print section of file from regular expression to end of file
awk '/regex/,0'
awk '/regex/,EOF'

Print section of file based on line numbers (lines 8-12, inclusive)
awk 'NR==8,NR==12'

Print line number 52
awk 'NR==52'
awk 'NR==52 {print;exit}'          # more efficient on large files

Print section of file between two regular expressions (inclusive)
awk '/Iowa/,/Montana/'             # case sensitive

Delete ALL blank lines from a file (same as "grep '.' ")
awk NF myfile > myoutput
awk '/./'

awk '{n=5 ; print $n}' myinput
prints the fifth field in the input record

To list a file, but skip over all the blank lines at the start of the file, use the command:
awk "/[^ ]/ { copy=1 }; copy { print }" filename.ext

To list all the lines of a file (TMP.LOG), except those containing the string "Frame overlap", you can use the command:
awk "!/Frame overlap/" TMP.LOG

Adding a blank line after lines in a list
To add a blank line after all lines containing "util.h":
awk "{ print $0; if ( $0 ~ /util.h/) print \"\" }" TMP.TMP

expression: none do this for all lines
action: print $0; if ( $0 ~ /util.h/ ) print ""  print the line, then if the line contains "util.h" print a blank line

to add a blank line at the end of a file:
awk '{print $0} END {print ""}' myfile.txt > myoutput.txt

Dealing with $ in SDF files
sed '
/$$$$/ {
N
/\n.*ISIS/ {
s/$$$$.*\n.*ISIS/$$$$ ISIS/
}
}'

awk 'BEGIN {for (x=1; x<=50; ++x) {printf("%3d\n",x) >> "tfile"}}'
dumps the numbers from 1 to 50 into "tfile".

Output can also be "piped" into another utility with the "|" ("pipe") operator. One can pipe output to the "tr" ("translate") utility to convert it to upper-case:
awk 'BEGIN { print "this is a test"}' | tr "[a-z]" "[A-Z]"
yields:
   THIS IS A TEST

To run AWK
-You can type: awk '{.....}' myinputfile > myoutputfile
or >> myoutputfile to append to an existing file

-You can put the script in a file and run: awk -f myscript myfile...and many other ways around
things can depend on the shell you are using, sh, csh, tcsh, bash...so watch out the behavior

-F fs
 Sets the FS variable to fs (see section Specifying how Fields are Separated).
 -f source-file
 Indicates that the awk program is to be found in source-file instead of in the first non-option argument.
 -v var=val
 Sets the variable var to the value val before execution of the program begins. Such variable values are available inside the BEGIN rule (see below for a fuller explanation).  The `-v' option can only set one variable, but you can use it more than once, setting another variable each time, like this: `-v foo=1 -v bar=2'.

ESSENTIAL SYNTAX
Arithmetic
Operator    Type      Meaning            
+        Arithmetic   Addition        
-        Arithmetic   Subtraction    
*        Arithmetic   Multiplication
/        Arithmetic   Division        
%       Arithmetic   Modulo        
++              increment
 --              decrement
^               exponential
 +=              plus equals
 -=              minus equals
 *=              multiply equals
 /=              divide equals
 %=              modulus equals
 ^=              exponential equals



awk 'BEGIN {OFS="\t"} {print $1,$2,$3,(($1+$2+$3)/3)}' IN > OUT
Will print out column 1, column2, column 3, and the mean of the 3 columns

awk '{print ($1/66),($2/6430),($3/627)}' IN>OUT
Dividing each column by different numbers


awk '{printf "%.3f\t%.3f\t%.3f\n", ($1/66),($2/6430),($3/627)}' IN>OUT
printf to format 3 significant digits and separate with tabs

Conditional expressions
Operator   Meaning              
==             Is equal              
!=          Is not equal to          
>          Is greater than          
>=      Is greater than or equal to
<          Is less than          
<=      Is less than or equal to      


Regular Expression Operators
Operator     Meaning         
~                Matches         
!~         Doesn't match  

ex: word !~ /START/

AND and OR and not matching
&& and ||  and !

Built-in VARIABLES
Records and Fields
Awk input is divided into records terminated by a record separator.   The  default  record  separator  is a newline, so by default awk processes its input a line at a time.  The number of the current record is available in a variable named NR.
Each   input   record  is  considered  to  be  divided  into fields.  Fields are normally  separated  by  white  space  --blanks  or  tabs -- but the input field separator may be changed.  Fields are referred to as  $1,  $2,  and  so forth,  where  $1  is  the first field, and $0 is the whole input record itself.  Fields may be assigned too.  The number of  fields in the current record is available in a variable named NF.
The  variables FS and RS refer to the input field and record separators; they may be changed at any time to any single character.   The optional command-line argument -Fc may also be used to set FS to the character c.
The variable FILENAME contains the name of the current input file.

print $0, this prints the full line, if it has 8 fields, it does equivalent to:
print $1, $2, $3, $4, $5, $6, $7, $8

FS - The Input Field Separator
The input field separator, a blank by default

FNR
The input record number in the current input file

OFS - The Output Field Separator
The output field separator, a blank by default

ORS - The Output line record Separator
The output record separator, by default a newline

NF - The Number of Fields
Awk counts the number of fields in the input line and put it into a variable called NF.
awk '{print NF, $NF}' myinput
print the number of field and the last field of each line

NR - The Number of Records - the current input line number
Awk counts the number of lines it reads
awk '{print NR, $0}'
This prints the line number and the complete line

RS - The input line Record Separator (default = newline)
The input record separator, by default a newline

Change the RS

'BEGIN {
# change the record separator from newline to nothing   
    RS=""
# change the field separator from whitespace to newline
    FS="n"
}
{
# print the second and third line of the file
    print $2, $3}' myfile

if myfile is:
50.211 14.979 24.196
50.142 15.162 25.415

awk 'BEGIN {RS=""; FS="n"} {print $2}' myfile
gives:
50.142 15.162 25.415

Arrays
awk provides single dimensioned arrays. Arrays need not be declared, they are created in the same manner as awk user defined variables.
Elements can be specified as numeric or string values.


Length
this counts the number of characters in a string
awk '{print length($0)}' myfile


Print and Printf
The print statement does output with simple, standardized formatting. You specify only the strings or numbers to be printed, in a list separated by commas. They are output, separated by single spaces, followed by a newline. The statement looks like this:
print item1 , item2 , ...

The simple statement print with no items is equivalent to print $0
it prints the entire current record. To print a blank line, use 'print ""', where "" is the null, or empty, string.

Using printf Statements for Fancier Printing
A format specifier starts with the character % and ends with a format-control letter; it tells the printf statement how to output one item. The format-control letter specifies what kind of value to print. The rest of the format specifier is made up of optional modifiers which are parameters such as the field width to use.

 Here is a list of the format-control letters:
 c    This prints a number as an ASCII character. Thus, 'printf "%c", 65' outputs the letter A. The output for a string value is the first character of the string.
 d     This prints a decimal integer.
 i    This also prints a decimal integer.
 e    This prints a number in scientific (exponential) notation. For example,
printf "%4.3e", 1950

 prints 1.950e+03, with a total of four significant figures of which three follow the decimal point. The 4.3 are modifiers, discussed below.
f    This prints a number in floating point notation.
g      This prints a number in either scientific notation or floating point notation, whichever uses fewer characters.
o    This prints an unsigned octal integer.
s      This prints a string.
x     This prints an unsigned hexadecimal integer.

%    This isn't really a format-control letter, but it does have a meaning when used after a %: the sequence `%%' outputs one '%'. It does not consume an argument.

A format specification can also include modifiers that can control how much of the item's value is printed and how much space it gets. The modifiers come between the '%' and the format-control letter. Here are the possible modifiers, in the order in which they may appear:
 '-'
The minus sign, used before the width modifier, says to left-justify the argument within its specified width. Normally the argument is printed right-justified in the specified width. Thus,
printf "%-4s", "foo"

 prints 'foo '.

'width'
This is a number representing the desired width of a field. Inserting any number between the '%' sign and the format control character forces the field to be expanded to this width. The default way to do this is to pad with spaces on the left. For example,
printf "%4s", "foo"

 prints ' foo'.  The value of width is a minimum width, not a maximum. If the item value requires more than width characters, it can be as wide as necessary. Thus,
printf "%4s", "foobar"

 prints 'foobar'.  Preceding the width with a minus sign causes the output to be padded with spaces on the right, instead of on the left.
 '.prec'
This is a number that specifies the precision to use when printing. This specifies the number of digits you want printed to the right of the decimal point. For a string, it specifies the maximum number of characters from the string that should be printed.

The C library printf's dynamic width and prec capability (for example, "%*.*s") is supported. Instead of supplying explicit width and/or prec values in the format string, you pass them in the argument list. For example:
w = 5
p = 3
s = "abcdefg"
printf "<%*.*s>\n", w, p, s


is exactly equivalent to
s = "abcdefg"
printf "<%5.3s>\n", s

Both programs output '<**abc>'. (the bullet symbol "*" is used to represent a space, to clearly show you that there are two spaces in the output.)


PRINT AND PRINTF again
The simplest output statement is the by-now familiar "print" statement. There's not too much to it:

    •      "Print" by itself prints the input line.

    •      "Print" with one argument prints the argument.

    •      "Print" with multiple arguments prints all the arguments, separated by  spaces (or other specified OFS) when the arguments are separated by commas, or concatenated when the arguments are separated by spaces.

 * The "printf()" (formatted print) function is much more flexible, and trickier. It has the syntax:
   printf(<string>,<expression list>)

 The "string" can be a normal string of characters:
   printf("Hi, there!")

 This prints "Hi, there!" to the display, just like "print" would, with one slight difference: the cursor remains at the end of the text, instead of skipping to the next line, as it would with "print". A "newline" code ("\n") has to be added to force "printf()" to skip to the next line:
   printf("Hi, there!\n")

 So far, "printf()" looks like a step backward from "print", and if you use it to do dumb things like this, it is. However, "printf()" is useful when you want precise control over the appearance of the output.

 The trick is that the string can contain format or "conversion" codes to control the results of the expressions in the expression list. For example, the following program:
   BEGIN {x = 35; printf("x = %d decimal, %x hex, %o octal.\n",x,x,x)}

 -- prints:
   x = 35 decimal, 23 hex, 43 octal.

 The format codes in this example include: "%d" (specifying decimal output), "%x" (specifying hexadecimal output), and "%o" (specifying octal output). The "printf()" function substitutes the three variables in the expression list for these format codes on output.

 * The format codes are highly flexible and their use can be a bit confusing. The "d" format code prints a number in decimal format. The output is an integer, even if the number is a real, like 3.14159. Trying to print a string with this format code results in a "0" output. For example:
   x = 35;     printf("x = %d\n",x)       yields:  x = 35
   x = 3.1415; printf("x = %d\n",x)       yields:  x = 3
   x = "TEST"; printf("x = %d\n",x)       yields:  x = 0

 * The "o" format code prints a number in octal format. Other than that, this format code behaves exactly as does the "%d" format specifier. For example:
   awk 'BEGIN {x = 255; printf("x = %o\n",x)}'          yields:  x = 377

 * The "x" format code prints a number in hexadecimal format. Other than that, this format code behaves exactly as does the "%d" format specifier. For example:
   x = 197; printf("x = %x\n",x)          yields:  x = c5

 * The "c" format code prints a character, given its numeric code. For example, the following statement outputs all the printable characters:
   BEGIN {for (ch=32; ch<128; ch++) printf("%c   %c\n",ch,ch+128)}

 * The "s" format code prints a string. For example:
   x = "jive"; printf("string = %s\n",x)  yields:  string = jive

 * The "e" format code prints a number in exponential format, in the default format:
   [-]D.DDDDDDe[+/-]DDD

 For example:
   x = 3.1415; printf("x = %e\n",x)       yields:  x = 3.141500e+000

 * The "f" format code prints a number in floating-point format, in the default format:
   [-]D.DDDDDD

 For example:
   x = 3.1415; printf("x = %f\n",x)       yields:  f = 3.141500

 * The "g" format code prints a number in exponential or floating-point format, whichever is shortest.

 * A numeric string may be inserted between the "%" and the format code to specify greater control over the output format. For example:
   %3d
   %5.2f
   %08s
   %-8.4s

 This works as follows:

    •      The integer part of the number specifies the minimum "width", or number of  spaces, the output will use, though the output may exceed that width if it  is too long to fit.

    •      The fractional part of the number specifies either, for a string, the  maximum number of characters to be printed; or, for floating-point  formats, the number of digits to be printed to the right of the decimal  point.

    •      A leading "-" specifies left-justified output. The default is right-justified output.

    •      A leading "0" specifies that the output be padded with leading zeroes to  fill up the output field. The default is spaces.

 For example, consider the output of a string:
   x = "Baryshnikov"
   printf("[%3s]\n",x)          yields:       [Baryshnikov]
   printf("[%16s]\n",x)         yields:       [     Baryshnikov]
   printf("[%-16s]\n",x)        yields:       [Baryshnikov     ]
   printf("[%.3s]\n",x)         yields:       [Bar]
   printf("[%16.3s]\n",x)       yields:       [             Bar]
   printf("[%-16.3s]\n",x)      yields:       [Bar             ]
   printf("[%016s]\n",x)        yields:       [00000Baryshnikov]
   printf("[%-016s]\n",x)       yields:       [Baryshnikov     ]

 -- or an integer:
   x = 312
   printf("[%2d]\n",x)          yields:       [312]
   printf("[%8d]\n",x)          yields:       [     312]
   printf("[%-8d]\n",x)         yields:       [312     ]
   printf("[%.1d]\n",x)         yields:       [312]
   printf("[%08d]\n",x)         yields:       [00000312]
   printf("[%-08d]\n",x)        yields:       [312     ]

 -- or a floating-point number:
   x = 251.673209
   printf("[%2f]\n",x)          yields:       [251.67309]
   printf("[%16f]\n",x)         yields:       [      251.67309]
   printf("[%-16f]\n",x)        yields:       [251.67309      ]
   printf("[%.3f]\n",x)         yields:       [251.673]
   printf("[%16.3f]\n",x)       yields:       [        251.673]
   printf("[%016.3f]\n",x)      yields:       [00000000251.673]

----------

The keywords BEGIN and END are used to perform specific actions before and after reading the input lines. The BEGIN keyword is normally associated with printing titles and setting default values, whilst the END keyword is normally associated with printing totals

awk 'BEGIN {string = "Super" "power"; print string}'
this will print: Superpower

 For example, to extract and print the word "get" from "unforgettable":
 BEGIN {print substr("unforgettable",6,3)}

 Please be aware that the first character of the string is numbered "1", not "0". To extract a substring of at most ten characters, starting from position 6 of the first field variable, you use:
 substr($1,6,10)

Escape sequences
Sequence   Description                  
\b          Backspace                      
\f          Formfeed                      
\n          Newline                      
\r          Carriage Return                  
\t          Horizontal tab
\"           Double quote
\a          The "alert" character; usually the ASCII BEL character
\v          Vertical tab
                  

example: awk '{ print $0 "\n"}' myfile
add a new empty line after each line

Regular Expressions
Pattern searching similar to grep and other unix utilities:
        /386/
        $1  ~  /386/
In regular expressions, the following symbols are metacharacters with special meanings.
        \  ^  $  .  [  ]  *  +  ?  (  )  |

        ^       matches the first character of a string
        $       matches the last character of a string
        .       matches a single character of a string
        [ ]     defines a set of characters
        ( )     used for grouping
        |       specifies alternatives

displays all which do not contain 2, 3, 4, 6 or 8 in first field
awk '$1  ~  /[^23468]/  { print $0 }'


How to hide special characters from the shell, this depends on the shell !
Preceding any single character with a backslash ('\') quotes that character.

Thus:
awk "BEGIN { print \"Don't Panic!\" }"
you get
tcsh: Unmatched '

but if you use bash, it works
With tcsh you need to write this:
awk 'BEGIN { print "Here is a single quote '\''" }'
the result is:
Here is a single quote '


Regular expressions are the extended kind found in egrep. They are composed of characters as follows:
 c
matches the character c (assuming c is a character with no special meaning in regexps).
 \c
matches the literal character c.
 .
matches any character except newline.
 ^
matches the beginning of a line or a string.
 $
matches the end of a line or a string.
 [abc...]
matches any of the characters abc... (character class).
 [^abc...]
matches any character except abc... and newline (negated character class).
 r1|r2
matches either r1 or r2 (alternation).
 r1r2
matches r1, and then r2 (concatenation).
 r+
matches one or more r's.
 r*
matches zero or more r's.
 r?
matches zero or one r's.
 (r)
matches r (grouping).


* The simplest kind search pattern that can be specified is a simple string, enclosed in forward-slashes ("/"). For example:
   /The/

 -- searches for any line that contains the string "The". This will not match "the" as Awk is "case-sensitive", but it will match words like "There" or "Them".

 This is the crudest sort of search pattern. Awk defines special characters or "metacharacters" that can be used to make the search more specific. For example, preceding the string with a "^" tells Awk to search for the string at the beginning of the input line. For example:
   /^The/

 -- matches any line that begins with the string "The". Similarly, following the string with a "$" matches any line that ends with "The", for example:
   /The$/

 But what if you actually want to search the text for a character like "^" or "$"? Simple, just precede the character with a backslash ("\"). For example:
   /\$/

 -- matches any line with a "$" in it.

 * Such a pattern-matching string is known as a "regular expression". There are many different characters that can be used to specify regular expressions. For example, it is possible to specify a set of alternative characters using square brackets ("[]"):
   /[Tt]he/

 This example matches the strings "The" and "the". A range of characters can also be specified. For example:
   /[a-z]/

 -- matches any character from "a" to "z", and:
   /[a-zA-Z0-9]/

 -- matches any letter or number.

 A range of characters can also be excluded, by preceding the range with a "^". For example:
   /^[^a-zA-Z0-9]/

 -- matches any line that doesn't start with a letter or digit.

 A "|" allows regular expressions to be logically ORed. For example:
   /(^Germany)|(^Netherlands)/

 -- matches lines that start with the word "Germany" or the word "Netherlands". Notice how parentheses are used to group the two expressions.

 * The "." special characters allows "wildcard" matching, meaning it can be used to specify any arbitrary character. For example:
   /wh./

 -- matches "who", "why", and any other string that has the characters "wh" and any following character.

 This use of the "." wildcard should be familiar to UN*X shell users, but awk interprets the "*" wildcard in a subtly different way. In the UN*X shell, the "*" substitutes for a string of arbitrary characters of any length, including zero, while in awk the "*" simply matches zero or more repetitions of the previous character or expression. For example, "a*" would match "a", "aa", "aaa", and so on. That means that ".*" will match any string of characters.

 There are other characters that allow matches against repeated characters expressions. A "?" matches zero or one occurrences of the previous regular expression, while a "+" matches one or more occurrences of the previous regular expression. For example:
   /^[+-]?[0-9]+$/

 -- matches any line that consists only of a (possibly signed) integer number. This is a somewhat confusing example and it is helpful to break it down by parts:
   /^                  Find string at beginning of line.
   /^[-+]?             Specify possible "-" or "+" sign for number.
   /^[-+]?[0-9]+       Specify one or more digits "0" through "9".
   /^[-+]?[0-9]+$/     Specify that the line ends with the number.


The search can be constrained to a single field within the input line. For example:
   $1 ~ /^France$/

 -- searches for lines whose first field ("$1" -- more on "field variables" later) is the word "France", while:
   $1 !~ /^Norway$/

 -- searches for lines whose first field is not the word "Norway".

 It is possible to search for an entire series or "block" of consecutive lines in the text, using one search pattern to match the first line in the block and another search pattern to match the last line in the block. For example:
   /^Ireland/,/^Summary/

 -- matches a block of text whose first line begins with "Ireland" and whose last line begins with "Summary".

   NF == 0

 -- matches all blank lines, or those whose number of fields is zero.
   $1 == "France"

 -- is a string comparison that matches any line whose first field is the string "France". The astute reader may notice that this example seems to do the same thing as a the previous example:
   $1 ~ /^France$/

 In fact, both examples do the same thing, but in the example immediately above the "^" and "$" metacharacters had to be used in the regular expression to specify a match with the entire first field; without them, it would match such strings as "FranceFour", "NewFrance", and so on. The string expression matches only to "France".

 * It is also possible to combine several search patterns with the "&&" (AND) and "||" (OR) operators. For example:
   ((NR >= 30) && ($1 == "France")) || ($1 == "Norway")

 -- matches any line past the 30th that begins with "France", or any line that begins with "Norway".

 * One class of pattern-matching that wasn't listed above is performing a numeric comparison on a field variable. It can be done, of course; for example:
   $1 == 100

 -- matches any line whose first field has a numeric value equal to 100. This is a simple thing to do and it will work fine. However, suppose you want to perform:
   $1 < 100

 This will generally work fine, but there's a nasty catch to it, which requires some explanation. The catch is that if the first field of the input can be either a number or a text string, this sort of numeric comparison can give crazy results, matching on some text strings that aren't equivalent to a numeric value.

 This is because awk is a "weakly-typed" language. Its variables can store a number or a string, with awk performing operations on each appropriately. In the case of the numeric comparison above, if $1 contains a numeric value, awk will perform a numeric comparison on it, as expected; but if $1 contains a text string, awk will perform a text comparison between the text string in $1 and the three-letter text string "100". This will work fine for a simple test of equality or inequality, since the numeric and string comparisons will give the same results, but it will give crazy results for a "less than" or "greater than" comparison.

 Awk is not broken; it is doing what it is supposed to do in this case. If this problem comes up, it is possible to add a second test to the comparison to determine if the field contains a numeric value or a text string. This second test has the form:
   (( $1 + 0 ) == $1 )

 If $1 contains a numeric value, the left-hand side of this expression will add 0 to it, and awk will perform a numeric comparison that will always be true.

 If $1 contains a text string that doesn't look like a number, for want of anything better to do awk will interpret its value as 0. This means the left-hand side of the expression will evaluate to zero; since there is a non-numeric text string in $1, awk will perform a string comparison that will always be false. This leads to a more workable comparison:
   ((( $1 + 0 ) == $1 ) && ( $1 > 100 ))


AWK Numerical Functions
Name    Function           
cos(x)     Cosine with x in radians       
exp(x)     Exponent       
int(x)     Integer part of x truncated towards 0        
log(x)     Logarithm  (natural logarithm of x )
sin(x)     Sine  with x in radians     
sqrt(x)    Square Root   
atan2(y,x)   Arctangent of y/x in radians       
rand()    Random      
srand(x)   Seed Random  
      

awk 'BEGIN { for (i = 1; i <= 7; i++) print int(101 * rand()) }'
This program prints 7 random numbers from 0 to 100, inclusive.

awk '{print sqrt($1)}' myfile
Print the square root for numbers in field 1

rand()
This gives you a random number. The values of rand are uniformly-distributed between 0 and 1. The value is never 0 and never 1.  Often you want random integers instead. Here is a user-defined function you can use to obtain a random nonnegative integer less than n:
function randint(n) {
     return int(n * rand())
}

 The multiplication produces a random real number greater than 0 and less than n. We then make it an integer (using int) between 0 and n - 1.  Here is an example where a similar function is used to produce random integers between 1 and n. Note that this program will print a new random number for each input record.
awk '
# Function to roll a simulated die.
function roll(n) { return 1 + int(rand() * n) }

# Roll 3 six-sided dice and print total number of points.
{
      printf("%d points\n", roll(6)+roll(6)+roll(6))
}'

 Note: rand starts generating numbers from the same point, or seed, each time you run awk. This means that a program will produce the same results each time you run it. The numbers are random within one awk run, but predictable from run to run. This is convenient for debugging, but if you want a program to do different things each time it is used, you must change the seed to a value that will be different in each run. To do this, use srand.
 srand(x)
The function srand sets the starting point, or seed, for generating random numbers to the value x.  Each seed value leads to a particular sequence of "random" numbers. Thus, if you set the seed to the same value a second time, you will get the same sequence of "random" numbers again.  If you omit the argument x, as in srand(), then the current date and time of day are used for a seed. This is the way to get random numbers that are truly unpredictable.  The return value of srand is the previous seed. This makes it easy to keep track of the seeds for use in consistently reproducing sequences of random numbers.

String Functions
     index(string,search)           
     length(string)               
     split(string,array,separator)   
     substr(string,position)          
     substr(string,position,max)      
     sub(regex,replacement)           
     sub(regex,replacement,string)    
     gsub(regex,replacement)           
     gsub(regex,replacement,string)   
     match(string,regex)           
     tolower(string)               
     toupper(string)
     system(cmd-line)
             Execute the command cmd-line, and return the exit status             
    
Example
The string function gsub to replace each occurrence of 286 with the string AT
awk '{ gsub( /286/, "AT" ); print $0 }' myfile

awk '{print tolower($0)}' myfile

If myfile contains:
50.211 14.979 24.196
50.142 15.162 25.415

awk '{split($0,a," "); print a[1]}' myfile
will give :
50.211
50.142

if i do this only on line 1:
awk 'NR ==1 {split($0,a," "); print a[1]}' 2points
i get:
50.211

If the myfile contains:
Processing NGC 2345

awk '{print substr($0,12,8)}' myfile
will give: NGC 2345

The "split()" function has the syntax:
   split(<string>,<array>,[<field separator>])

 This function takes a string with n fields and stores the fields into array[1], array[2], ... , array[n]. If the optional field separator is not specified, the value of FS (normally "white space", the space and tab characters) is used. For example, suppose we have a field of the form:
   joe:frank:harry:bill:bob:sil

 We could use "split()" to break it up and print the names as follows:
   my_string = "joe:frank:harry:bill:bob:sil";
   split(my_string,names,":");
   print names[1];
   print names[2];
   ...

 The "index()" function has the syntax:
   index(<target string>,<search string>)

 -- and returns the position at which the search string begins in the target string (remember, the initial position is "1"). For example:
   index("gorbachev","bach")         returns:  4
   index("superficial","super")      returns:  1
   index("sunfire","fireball")       returns:  0
   index("aardvark","z")             returns:  0

Match(string, regexp)
The match function searches the string, string, for the longest, leftmost substring matched by the regular expression, regexp. It returns the character position, or index, of where that substring begins (1, if it starts at the beginning of string). If no match if found, it returns 0.   The match function sets the built-in variable RSTART to the index. It also sets the built-in variable RLENGTH to the length in characters of the matched substring. If no match is found, RSTART is set to 0, and RLENGTH to -1.  For example:
awk '{
       if ($1 == "FIND")
         regex = $2
       else {
         where = match($0, regex)
         if (where)
           print "Match of", regex, "found at", where, "in", $0
       }
}'

This program looks for lines that match the regular expression stored in the variable regex. This regular expression can be changed. If the first word on a line is FIND, regex is changed to be the second word on that line. Therefore, given:
FIND fo*bar
My program was a foobar
But none of it would doobar
FIND Melvin
JF+KM
This line is property of The Reality Engineering Co.
This file created by Melvin.

awk prints:
Match of fo*bar found at 18 in My program was a foobar
Match of Melvin found at 26 in This file created by Melvin.

split(string, array, fieldsep)
This divides string into pieces separated by fieldsep, and stores the pieces in array. The first piece is stored in array[1], the second piece in array[2], and so forth. The string value of the third argument, fieldsep, is a regexp describing where to split string (much as FS can be a regexp describing where to split input records). If the fieldsep is omitted, the value of FS is used. split returns the number of elements created. The split function, then, splits strings into pieces in a manner similar to the way input lines are split into fields. For example:
split("auto-da-fe", a, "-")

splits the string `auto-da-fe' into three fields using `-' as the separator. It sets the contents of the array a as follows:
a[1] = "auto"
a[2] = "da"
a[3] = "fe"

The value returned by this call to split is 3.  As with input field-splitting, when the value of fieldsep is " ", leading and trailing whitespace is ignored, and the elements are separated by runs of whitespace.
 sprintf(format, expression1,...)
This returns (without printing) the string that printf would have printed out with the same arguments (see section Using printf Statements for Fancier Printing). For example:
sprintf("pi = %.2f (approx.)", 22/7)

 returns the string "pi = 3.14 (approx.)".
 sub(regexp, replacement, target)
The sub function alters the value of target. It searches this value, which should be a string, for the leftmost substring matched by the regular expression, regexp, extending this match as far as possible. Then the entire string is changed by replacing the matched text with replacement. The modified string becomes the new value of target.  This function is peculiar because target is not simply used to compute a value, and not just any expression will do: it must be a variable, field or array reference, so that sub can store a modified value there. If this argument is omitted, then the default is to use and alter $0.  For example:
str = "water, water, everywhere"
sub(/at/, "ith", str)

sets str to "wither, water, everywhere", by replacing the leftmost, longest occurrence of 'at' with 'ith'.  The sub function returns the number of substitutions made (either one or zero).  If the special character '&' appears in replacement, it stands for the precise substring that was matched by regexp. (If the regexp can match more than one string, then this precise substring may vary.) For example:
awk '{ sub(/candidate/, "& and his wife"); print }'

changes the first occurrence of 'candidate' to 'candidate and his wife' on each input line.  Here is another example:
awk 'BEGIN {
        str = "daabaaa"
        sub(/a*/, "c&c", str)
        print str
}'

 prints 'dcaacbaaa'. This show how '&' can represent a non-constant string, and also illustrates the "leftmost, longest" rule.  The effect of this special character ('&') can be turned off by putting a backslash before it in the string. As usual, to insert one backslash in the string, you must write two backslashes. Therefore, write '\\&' in a string constant to include a literal '&' in the replacement. For example, here is how to replace the first `|' on each line with an '&':
awk '{ sub(/\|/, "\\&"); print }'

Note: as mentioned above, the third argument to sub must be an lvalue. Some versions of awk allow the third argument to be an expression which is not an lvalue. In such a case, sub would still search for the pattern and return 0 or 1, but the result of the substitution (if any) would be thrown away because there is no place to put it. Such versions of awk accept expressions like this:
sub(/USA/, "United States", "the USA and Canada")

 But that is considered erroneous in gawk.
 gsub(regexp, replacement, target)
 This is similar to the sub function, except gsub replaces all of the longest, leftmost, nonoverlapping matching substrings it can find. The 'g' in gsub stands for "global," which means replace everywhere. For example:
awk '{ gsub(/Britain/, "United Kingdom"); print }'

 replaces all occurrences of the string 'Britain' with 'United Kingdom' for all input records. The gsub function returns the number of substitutions made. If the variable to be searched and altered, target, is omitted, then the entire input record, $0, is used. As in sub, the characters '&' and '\' are special, and the third argument must be an lvalue.
 substr(string, start, length)
 This returns a length-character-long substring of string, starting at character number start. The first character of a string is character number one. For example, substr("washington", 5, 3) returns "ing". If length is not present, this function returns the whole suffix of string that begins at character number start. For example, substr("washington", 5) returns "ington". This is also the case if length is greater than the number of characters remaining in the string, counting from character number start.
 tolower(string)
 This returns a copy of string, with each upper-case character in the string replaced with its corresponding lower-case character. Nonalphabetic characters are left unchanged. For example, tolower("MiXeD cAsE 123") returns "mixed case 123".
 toupper(string)
 This returns a copy of string, with each lower-case character in the string replaced with its corresponding upper-case character. Nonalphabetic characters are left unchanged. For example, toupper("MiXeD cAsE 123") returns "MIXED CASE 123".


The Match function
      getline                  
      getline <file             
      getline variable              
      getline variable <file    
awk provides the function getline to read input from the current input file or from a file or pipe.
getline reads the next input line, splitting it into fields according to the settings of NF, NR and FNR. It returns 1 for success, 0 for end-of-file, and -1 on error.
The statement
        getline < "temp.dat"
reads the next input line from the file "temp.dat", field splitting is performed, and NF is set.

The statement
        getline data < "temp.dat"
reads the next input line from the file "temp.dat" into the user defined variable data, no field splitting is done, and NF, NR and FNR are not altered.

You can take input from keyboard while running awk script, try the following awk script:
      awk 'BEGIN {print "your name"; getline na <"-"; print "my name is " na}'
Here getline function is used to read input from keyboard and then assign the data (inputted from keyboard) to variable.
Syntax:
getline variable-name < "-"
|              |                      |
1            2                     3

1 --> getline is function name
2 --> variable-name is used to assign the value read from input
3 --> Means read from stdin (keyboard)


Function Definition Example
Here is an example of a user-defined function, called myprint, that takes a number and prints it in a specific format.
function myprint(num)
{
     printf "%6.3g\n", num
}


 To illustrate, here is an awk rule which uses our myprint function:
$3 > 0     { myprint($3) }

 This program prints, in our special format, all the third fields that contain a positive number in our input. Therefore, when given:
 1.2   3.4    5.6   7.8
 9.10 11.12 -13.14 15.16
17.18 19.20  21.22 23.24


 this program, using our function to format the results, prints:
   5.6
  21.2


 Here is an example of a recursive function. It prints a string backwards:
function rev (str, len) {
    if (len == 0) {
        printf "\n"
        return
    }
    printf "%c", substr(str, len, 1)
    rev(str, len - 1)
}

Calling User-defined Functions
Calling a function means causing the function to run and do its job. A function call is an expression, and its value is the value returned by the function.
A function call consists of the function name followed by the arguments in parentheses. What you write in the call for the arguments are awk expressions; each time the call is executed, these expressions are evaluated, and the values are the actual arguments. For example, here is a call to foo with three arguments (the first being a string concatenation):
foo(x y, "lose", 4 * z)

Caution: whitespace characters (spaces and tabs) are not allowed between the function name and the open-parenthesis of the argument list. If you write whitespace by mistake, awk might think that you mean to concatenate a variable with an expression in parentheses. However, it notices that you used a function name and not a variable name, and reports an error.

When a function is called, it is given a copy of the values of its arguments. This is called call by value. The caller may use a variable as the expression for the argument, but the called function does not know this: it only knows what value the argument had. For example, if you write this code:
foo = "bar"
z = myfunc(foo)

then you should not think of the argument to myfunc as being "the variable foo." Instead, think of the argument as the string value, "bar".

If the function myfunc alters the values of its local variables, this has no effect on any other variables. In particular, if myfunc does this:
function myfunc (win) {
  print win
  win = "zzz"
  print win
}

to change its first argument variable win, this does not change the value of foo in the caller. The role of foo in calling myfunc ended when its value, "bar", was computed. If win also exists outside of myfunc, the function body cannot alter this outer value, because it is shadowed during the execution of myfunc and cannot be seen or changed from there.

However, when arrays are the parameters to functions, they are not copied. Instead, the array itself is made available for direct manipulation by the function. This is usually called call by reference. Changes made to an array parameter inside the body of a function are visible outside that function.  This can be very dangerous if you do not watch what you are doing. For example:
function changeit (array, ind, nvalue) {
     array[ind] = nvalue
}

BEGIN {
           a[1] = 1 ; a[2] = 2 ; a[3] = 3
           changeit(a, 2, "two")
           printf "a[1] = %s, a[2] = %s, a[3] = %s\n", a[1], a[2], a[3]
      }


 prints 'a[1] = 1, a[2] = two, a[3] = 3', because calling changeit stores "two" in the second element of a.


The return Statement
The body of a user-defined function can contain a return statement. This statement returns control to the rest of the awk program. It can also be used to return a value for use in the rest of the awk program. It looks like this:
return expression

 The expression part is optional. If it is omitted, then the returned value is undefined and, therefore, unpredictable.

 A return statement with no value expression is assumed at the end of every function definition. So if control reaches the end of the function body, then the function returns an unpredictable value.  awk will not warn you if you use the return value of such a function; you will simply get unpredictable or unexpected results.

Here is an example of a user-defined function that returns a value for the largest number among the elements of an array:
function maxelt (vec,   i, ret) {
     for (i in vec) {
          if (ret == "" || vec[i] > ret)
               ret = vec[i]
     }
     return ret
}


You call maxelt with one argument, which is an array name. The local variables i and ret are not intended to be arguments; while there is nothing to stop you from passing two or three arguments to maxelt, the results would be strange. The extra space before i in the function parameter list is to indicate that i and ret are not supposed to be arguments. This is a convention which you should follow when you define functions.

Here is a program that uses our maxelt function. It loads an array, calls maxelt, and then reports the maximum number in that array:
awk '
function maxelt (vec,   i, ret) {
     for (i in vec) {
          if (ret == "" || vec[i] > ret)
               ret = vec[i]
     }
     return ret
}

# Load all fields of each record into nums.
{
          for(i = 1; i <= NF; i++)
               nums[NR, i] = $i
}

END {
     print maxelt(nums)
}'


 Given the following input:
 1 5 23 8 16
44 3 5 2 8 26
256 291 1396 2962 100
-6 467 998 1101
99385 11 0 225

the program tells us that:
99385
is the largest number in our array.

awk Control Flow Statements
if (  expression )  statement1 else  statement2

while (  expression )   statement

for (  expression1;  expression;  expression2 )  statement


 The syntax of "if ... else" is:
   if (<condition>) <action 1> [else <action 2>]

 The "else" clause is optional. The "condition" can be any expression discussed in the section on pattern matching, including matches with regular expressions.
For example, consider the following Awk program:
   {if ($1=="green") print "GO";
    else if ($1=="yellow") print "SLOW DOWN";
    else if ($1=="red") print "STOP";
    else print "WHAT";}


The syntax for "while" is:
   while (<condition>) <action>

 The "action" is performed as long the "condition" tests true, and the "condition" is tested before each iteration. The conditions are the same as for the "if ... else" construct. For example, since by default an Awk variable has a value of 0, the following Awk program could print the numbers from 1 to 20:
   BEGIN {while(++x<=20) print x}

 * The "for" loop is more flexible. It has the syntax:
   for (<initial action>;<condition>;<end-of-loop action>) <action>

 For example, the following "for" loop prints the numbers 10 through 20 in increments of 2:
   BEGIN {for (i=10; i<=20; i+=2) print i}

 This is equivalent to:
   i=10
   while (i<=20) {
      print i;
      i+=2;}

The "for" loop has an alternate syntax, used when scanning through an array:
   for (<variable> in <array>) <action>

 with the example:
   my_string = "joe:frank:harry:bill:bob:sil";
   split(my_string, names, ":");

 -- then the names could be printed with the following statement:
   for (idx in names) print idx, names[idx];

 This yields:
   2 frank
   3 harry
   4 bill
   5 bob
   6 sil
   1 joe

 Notice that the names are not printed in the proper order. One of the characteristics of this type of "for" loop is that the array is not scanned in a predictable order.


Awk defines three unconditional control statements: "break", "continue", "next", and "exit". "Break" and "continue" are strictly associated with the "while" and "for" loops:

    •      break: Causes a jump out of the loop.

    •      continue: Forces the next iteration of the loop.

 "Next" and "exit" control Awk's input scanning:

    •      next: Causes Awk to immediately get another line of input and begin scanning it from the first match statement.

    •      exit: Causes Awk to end reading its input and execute END operations,  if any are specified.


Limits
Each implementation of awk imposes some limits. Below are typical limits
        100 fields
        2500 characters per input line
        2500 characters per output line
        1024 characters per individual field
        1024 characters per printf string
        400 characters maximum quoted string
        400 characters in character class
        15 open files
        1 pipe


EXAMPLES
There are millions of different ways to do things...here are few examples


The simplest action is to print some  or all  of  a record; this is accomplished by the awk command print.
The awk program

awk '{ print }' myfile
Prints each record
while
awk '{print $2, $1}' myfile
prints the first two fields in reverse order but
awk '{print $1 $2}' myfile
will group the 2 fields

awk '{ print $1 >"foo1"; print $2 >"foo2" }' myfile
will put the data into file foo1 and file foo2


The  variables OFS and ORS may be used to change the current output field separator and output record separator.   The  output record  separator  is  appended to the output of the print statement.
Awk also provides the printf statement  for  output  formatting.

BEGIN and END
The special pattern  BEGIN  matches  the  beginning  of  the input,  before the first record is read.  The pattern END matches the end of the input, after the last record has  been  processed. BEGIN and END thus provide a way to gain control before and after processing.


-------

awk '{ if ($2 =="0.5") {print $0} }' myfile
prints the lines for which field 2 = 0.5

Test count things:
awk 'BEGIN {counter = 0} {if ($2 == "0.5"){counter++}} END {print counter} '  myfile
this tells me how many times field 2 has a value of 0.5



Using Awk to create a simple histogram
We have a file with scores in a file called mydata
r   0.2     99
r   0.1     88
r   0.4     76
r   0.1     76
r   0.2     56
r   0.3     900
r   0.2     43
r   0.5     5
r   0.5     9
r   0.6     56
r   0.8     43
r   0.7     33
r   0.9     10



we can sort the second column with :

sort +1 -n mydata > mydata_sorted

this gives:
r   0.1     76
r   0.1     88
r   0.2     43
r   0.2     56
r   0.2     99
r   0.3     900
r   0.4     76
r   0.5     5
r   0.5     9
r   0.6     56
r   0.7     33
r   0.8     43
r   0.9     10


(sorting can be done in descending (reverse) order with sort -nr)


You can put the following lines in a file called histo.txt
to Print frequency histogram of column (field 2) of numbers
$2 <= 0.1 {na=na+1}
($2 > 0.1) && ($2 <= 0.2) {nb = nb+1}
($2 > 0.2) && ($2 <= 0.3) {nc = nc+1}
($2 > 0.3) && ($2 <= 0.4) {nd = nd+1}
($2 > 0.4) && ($2 <= 0.5) {ne = ne+1}
($2 > 0.5) && ($2 <= 0.6) {nf = nf+1}
($2 > 0.6) && ($2 <= 0.7) {ng = ng+1}
($2 > 0.7) && ($2 <= 0.8) {nh = nh+1}
($2 > 0.8) && ($2 <= 0.9) {ni = ni+1}
($2 > 0.9) {nj = nj+1}
END {print na, nb, nc, nd, ne, nf, ng, nh, ni, nj, NR}

and run

awk -f histo.txt mydata_sorted
this will give:
2 3 1 1 2 1 1 1 1  13
meaning, 0.1 occurs twice, 0.9, once


COUNTING score values after docking

awk 'BEGIN{
counter = 0
}
{
if ($3 >= 0 && $3 <= 1 ){
counter++
}
}
END{
print "Scores_0_to_1    " counter
}' mylistwithnameshort_test.txt

awk 'BEGIN{
counter = 0
}
{
if ($3 >= 1 && $3 <= 2 ){
counter++
}
}
END{
print "Scores_1_to_2    " counter
}' mylistwithnameshort_test.txt

awk 'BEGIN{
counter = 0
}
{
if ($3 >= 2 && $3 <= 3 ){
counter++
}
}
END{
print "Scores_2_to_3    " counter
}' mylistwithnameshort_test.txt

awk 'BEGIN{
counter = 0
}
{
if ($3 >= 3 && $3 <= 4 ){
counter++
}
}
END{
print "Scores_3_to_4    " counter
}' mylistwithnameshort_test.txt

awk 'BEGIN{
counter = 0
}
{
if ($3 >= 4 && $3 <= 5 ){
counter++
}
}
END{
print "Scores_4_to_5    " counter
}' mylistwithnameshort_test.txt


awk 'BEGIN{
counter = 0
}
{
if ($3 >= 5 && $3 <= 6 ){
counter++
}
}
END{
print "Scores_5_to_6    " counter
}' mylistwithnameshort_test.txt

awk 'BEGIN{
counter = 0
}
{
if ($3 >= 6 && $3 <= 7 ){
counter++
}
}
END{
print "Scores_6_to_7    " counter
}' mylistwithnameshort_test.txt

awk 'BEGIN{
counter = 0
}
{
if ($3 >= 7 && $3 <= 8 ){
counter++
}
}
END{
print "Scores_7_to_8    " counter
}' mylistwithnameshort_test.txt

awk 'BEGIN{
counter = 0
}
{
if ($3 >= 8 && $3 <= 9 ){
counter++
}
}
END{
print "Scores_8_to_9    " counter
}' mylistwithnameshort_test.txt

awk 'BEGIN{
counter = 0
}
{
if ($3 >= 9 && $3 <= 10 ){
counter++
}
}
END{
print "Scores_9_to_10   " counter
}' mylistwithnameshort_test.txt

awk 'BEGIN{
counter = 0
}
{
if ($3 >= 10 && $3 <= 11 ){
counter++
}
}
END{
print "Scores_10_to_11  " counter
}' mylistwithnameshort_test.txt

awk 'BEGIN{
counter = 0
}
{
if ($3 >= 11 && $3 <= 12 ){
counter++
}
}
END{
print "Scores_11_to_12  " counter
}' mylistwithnameshort_test.txt

awk 'BEGIN{
counter = 0
}
{
if ($3 >= 12 && $3 <= 13 ){
counter++
}
}
END{
print "Scores_12_to_13  " counter
}' mylistwithnameshort_test.txt

awk 'BEGIN{
counter = 0
}
{
if ($3 >= 13 && $3 <= 14 ){
counter++
}
}
END{
print "Scores_13_to_14  "     counter
}' mylistwithnameshort_test.txt




awk 'BEGIN{
counter = 0
}
{
if ($3 >= 14 && $3 <= 15 ){
counter++
}
}
END{
print "Scores_14_to_15  "     counter
}' mylistwithnameshort_test.txt



awk 'BEGIN{
counter = 0
}
{
if ($3 >= 15 && $3 <= 20 ){
counter++
}
}
END{
print "Scores_15_to_20  "     counter
}' mylistwithnameshort_test.txt


-------


echo "the script starts"
echo "check that each compound starts with ISIS"
echo -n "select the SDF file:"
read file1
echo "the name of the file is $file1"
tr '\r' '\n' < "$file1" > "tmp1_unix.sdf"
echo "step 1"
awk 'NF > 0' < "tmp1_unix.sdf" > "tmp2_unix_no_emptylines.sdf"
echo "this is done"

 

Word frequency
Print one word per line
xargs -n 1 < toto.txt
or
awk 'BEGIN{RS=" "} 1' myfile
or
awk -v OFS='\n' '{$1=$1}1' myfile

Then, something like this:
cat myfile | xargs -n 1 | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -n
or
cat myfile | xargs -n 1 | tr '[A-Z]' '[a-z]' | sort | uniq -c | sort -n


Rename files
#!/bin/sh
# we have less than 3 arguments. Print the help text:
if [ $# -lt 3 ] ; then
cat <<HELP
ren -- renames a number of files using sed regular expressions

USAGE: ren 'regexp' 'replacement' files...

EXAMPLE: rename all *.HTM files in *.html:
  ren 'HTM$' 'html' *.HTM

HELP
  exit 0
fi
OLD="$1"
NEW="$2"
# The shift command removes one argument from the list of
# command line arguments.
shift
shift
# $* contains now all the files:
for file in $*; do
    if [ -f "$file" ] ; then
      newfile=`echo "$file" | sed "s/${OLD}/${NEW}/g"`
      if [ -f "$newfile" ]; then
        echo "ERROR: $newfile exists already"
      else
        echo "renaming $file to $newfile ..."
        mv "$file" "$newfile"
      fi
    fi
done


Rename files
myfiles=`ls toto*`
for i in $myfiles ; do

echo "mv $i $i"\.smi
done

Some other examples
# Print first two fields in opposite order:
  awk '{ print $2, $1 }' file

# Print lines longer than 72 characters:
  awk 'length > 72' file

# Print length of string in 2nd column
  awk '{print length($2)}' file

# Add up first column, print sum and average:
       { s += $1 }
  END  { print "sum is", s, " average is", s/NR }

# Print fields in reverse order:
  awk '{ for (i = NF; i > 0; --i) print $i }' file

# Print the last line
      {line = $0}
  END {print line}

# Print the total number of lines that contain the word Pat
  /Pat/ {nlines = nlines + 1}
  END {print nlines}

# Print all lines between start/stop pairs:
  awk '/start/, /stop/' file

# Print all lines whose first field is different from previous one:
  awk '$1 != prev { print; prev = $1 }' file

# Print column 3 if column 1 > column 2:
  awk '$1 > $2 {print $3}' file

# Print line if column 3 > column 2:
  awk '$3 > $2' file

# Count number of lines where col 3 > col 1
  awk '$3 > $1 {print i + "1"; i++}' file

# Print sequence number and then column 1 of file:
  awk '{print NR, $1}' file

# Print every line after erasing the 2nd field
  awk '{$2 = ""; print}' file

# Print hi 28 times
  yes | head -28 | awk '{ print "hi" }'

# Print hi.0010 to hi.0099 (NOTE IRAF USERS!)
  yes | head -90 | awk '{printf("hi00%2.0f \n", NR+9)}'

# Print out 4 random numbers between 0 and 1
yes | head -4 | awk '{print rand()}'

# Print out 40 random integers modulo 5
yes | head -40 | awk '{print int(100*rand()) % 5}'


# Replace every field by its absolute value
  { for (i = 1; i <= NF; i=i+1) if ($i < 0) $i = -$i print}

# If you have another character that delimits fields, use the -F option
# For example, to print out the phone number for Jones in the following file,
# 000902|Beavis|Theodore|333-242-2222|149092
# 000901|Jones|Bill|532-382-0342|234023
# ...
# type
  awk -F"|" '$2=="Jones"{print $4}' filename

# Some looping commands
# Remove a bunch of print jobs from the queue
  BEGIN{
    for (i=875;i>833;i--){
        printf "lprm -Plw %d\n", i
    } exit
       }

 Formatted printouts are of the form printf( "format\n", value1, value2,
... valueN)
        e.g. printf("howdy %-8s What it is bro. %.2f\n", $1, $2*$3)
    %s = string
    %-8s = 8 character string left justified
     %.2f = number with 2 places after .
    %6.2f = field 6 chars with 2 chars after .
    \n is newline
    \t is a tab




  • Last updated on .

Email

bruno.villoutreix(at)gmail.com

Address

Follow me


© Bruno Villoutreix. A first version of this Website was launched in 2006. Thank to Natacha Oliveira