Unix tips

Some Unix tips to manipulate biology and chemistry files

The idea is to have very short & simple scripts or one line codes
to perform simple tasks on chemistry and protein (or other) files

Many others, like for instance, with these:
Pandas and biopandas help to manipulate data file (Python)
Also very valuable Jupyter Notebook, create and share documents that contain live code, equations, visualizations and narrative text
Machine learning in Python: scikit-learn
And nice overview: Python Data Science Handbook
Prepare or manipulate PDB files with python (eg, scripts to download files, etc, without dependencies) for instance pdb-tools
PubChemPy provides a way to interact with PubChem in Python

Preparing files
End of the line for different OS (eg invisible characters: control M on Windows...etc)

seen with:

cat -v filename

cat -t filename
(print special characters)

tr '\r' '\n' < macfile.txt > unixfile.txt
or
awk '{ gsub("\r", "\n"); print $0;}' macfile.txt > unixfile.txt

To convert a Unix file to Mac OS using awk, at the command line, enter:
awk '{ gsub("\n", "\r"); print $0;}' unixfile.txt > macfile.txt

Delete BOTH leading and trailing whitespace from each line
sed 's/^[ ^t]*//;s/[ ^]*$//' inputfile

Remove control M (often it means convert DOS newlines to Unix format) in place, for example in all xml files in a directory (warning this will change the files if originals are needed, duplicate them before):
perl -pi -e 's/\r/\n/g' *.xml (may add double lines)

perl -p -i -e "s/\r\n/\n/g" *.xml (should not add empty lines)

Find non-ASCII characters in a file
perl -lne 'print if /[^[:ascii:]]/' file.txt
or
perl -ne 'print "$. $_" if m/[\x80-\xFF]/' file.txt

Simple sed for html
sed -e 's/target="_blank">http.*</target="_blank">LINK<\/a></g' my.html > my2.html
This runs with sed on mac, warning with the different OS

Replace in csv file for instance, a ";" by a "tab". Ex with the results of FAF-Drugs filtering results.csv file
awk '{gsub(";","\t",$0); print;}' myfile.csv

Sort things out
If in the file we have compound names and energy scores
Toto,12
Toto2,17
Toto3,5

To sort after the "," in reverse order
sort -t',' -k2 -r -n

Copy a file into several directories
for dir in *.dir; do cp myfile.pdb "$dir" ; done

Renumber PDB file - here add 100 to residue numbers present in field 6 (warning heading and heterom atoms might be removed, some fields may not be separated by white space in input file, ANISOU should be removed...etc)

cat myfile.pdb | awk -v x=100 '{printf "%4s%7.0f%3s%6s%2s%4.0f%12.3f%8.3f%8.3f%6.2f%7.2f\n", $1, $2, $3, $4, $5, ($6 + x), $7, $8, $9, $10, $11}' >> renumbered.pdb

In a script file:

if [ -f "$1" ]; then

    grep -o '[[:alnum:]+\.\_\-]*@[[:alnum:]+\.\_\-]*' "$1" | sort | uniq -i

  else

    echo "Expected a file at $1, but it doesn't exist." >&2

    exit 1

  fi

Grep and count URLs

Check compound ID in field 1 already present in the file
With in file1, AA_20 and AA_21 being the same compound but the poses are different (this is to get the unique compound IDs):
AA_20
BB_30
AA_21
CC_300
AB_30

awk -F'_' '!seen[$1]++' file1
will print this:
AA_20
BB_30
CC_300
AB_30

Note:
Here awk uses associative arrays to remove duplicates. When a pattern appears for the 1st time, $1, the count for the pattern is incremented. This will make the count as 0 since it is a post-fix, and the negation of 0 which is 'True' makes the pattern printed. When the same pattern appears again, the count is now 1 and hence the inverse is 'False' and hence the pattern does not get printed.

find difference between two files

grep -v -f file1 file2

Remove duplicate compounds with Babel
babel infile.xxx outfile.yyy --unique cansmiNS
babel infile.xxx outfile.yyy --unique cansmi
babel infile.xxx outfile.yyy --unique /nostereo/nochg

if several files in a directory, possible also to do something like
babel *.smi outputfile.smi --unique /nostereo/nochg

SMI and SDF files with Babel and unix
If we have some SDF file and we want to fetch the line after a string, for example a field containing "Item" in the SDF file

I can do:

grep 'Item' -A1 myfile.sdf | grep -v 'Item' | grep -v '\-\-' > output

In output I have for instance the name of the compounds

Then with Babel i can get the SMI file for each molecule

I combine the two file with a: paste

If I want to add some info after the name of the molecules, also obtained with grep, I can add for instance TARGET in front each line with:
awk '{print "--TARGET--" $0;}' myfile-with-the-name-of-the-target-for-each-compound > output

Then i can paste without space my files with the ID number of the compounds and the targets for each compound, something like:
paste -d '\0' compound-names-file target-names > output

In the SMI file, I can check if salts or mixtures by searching for "."

something like:
grep '\.' myfile.smi | wc -l
grep '[.]' myfile.smi
or
grep -F . myfile.smi (the F is for fixed string)

In Babel i can remove salts or smaller molecules in mixtures with something like:
babel -isdf input -r -e -osmi outputfile

-r remove mixture, and -e escape nasty molecules and move the following one instead of crashing if the package can not deal with the nasty molecule

Remove the first line of a file
awk 'NR>1' myfile.txt > output.txt

Sort on the 3rd field in reverse order
sort -r -k 3,3 input > output

Keep unique lines without sorting in field 2
awk '!x[$2]++' input > output

Keep top 500 lines
head -500 infield > output
or
awk 'FNR <= 501’

Find the difference between two sorted files
diff file1 file2 | awk '/^>/{print $2 }'

MEAN, MEDIAN, MODE ...some codes from the forums....
Simple statistics - for much more, the R package is one solution
The mode in a list of numbers refers to the list of numbers that occur most frequently
Median is the 'middle value' in your list

To round up using awk:
echo "767.992" | awk '{printf("%5.0f\n",$0)}'

If needed
awk prints only the first record, thus removing duplicates from the dat file without sorting
awk '!x[$0]++' myfile | sed '/^\s*$/d'
sed '/^\s*$/d' to remove single blank lines after duplicates are removed
or other ways to remove blank line (maybe better than sed)
awk '!/^$/' myfile
or
awk 'NF > 0' filename
or
awk NF filename

if data separated with a "|"
awk -F\| '{printf("%5.0f\n",$0)}'

MEAN (for field 2)
cat myfile_with_separators_being_a_|_pipe | awk -F\| '{print $2}' | awk '{sum+=$1} END {print sum/NR}'

MEAN and MEDIAN (here also, data separated by a "|" pipe and done for field 2, thus $2
cat myfile | awk -F\| '{print $2}' | awk '{sum+=$1;a[x++]=$1;b[$1]++} END {print "Mean: " sum/x "\nMedian: "a[int((x+1)/2)] }'

Standard Deviation (field 2...)
cat myfile | awk -F\| '{print $2}' | awk '{sum+=$1; sumsq+=$1*$1;} END {print "stdev = " sqrt(sumsq/NR - (sum/NR)**2);}'

MODE (warning if no most frequent data are found, example for field 2 with many numbers, round-up before the process starts)
This could work for the entire file, then $0 should be used instead of $1 and the round-up step removed....depending your data
cat myfile | awk -F\| '{printf("%5.0f\n",$2)}' | awk '{++a[$1]}END{for(i in a)if(a[i]>max){max=a[i];k=i} print k}' > tmp_file
IF nothing is found, the output file is empty

Test if empty file, something like this ..many ways

awk NF try_empty.txt | awk 'END{print(NR>1)?"NOT-EMPTY":"EMPTY"}'
if in the file try_empty I have carriage return only, the output will be: EMPTY as it should

MODE
If I want only numbers before the "." (6.5, 8.5....)
awk -F\. '{for (i=1; i<=NF; ++i) if (max <= ++x[$i]) max = x[$i]} END { for (i in x) if (x[i] == max) print i }' mynumber.txt

MIN, MAX
seq 2 10 | | awk 'NR==1 {min=$1} NR>1 && $1<min { min=$1 } END {print min}'
seq 2 10 | awk 'NR==1 {max=$1} NR>1 && $1>max { max=$1 } END {print max}'

seq 2 10 | awk '{if(min==""){min=max=$1}; if($1>max) {max=$1}; if($1<min) {min=$1}; total+=$1; count+=1} END {print total/count, max, min}'

seq 10 | awk '{sum+=$1} END {print sum/NR}'
seq 10 | awk '{sum+=$1} END {print sum}'

MEAN, MEDIAN... again
in my file numbers.txt I have:
2.5
3.4
4
5.5
5.6
5
6
7

In a file named awk_mean_median_mode.txt i have:
(NB: If the file has headers on the first line, add NR > 1 in front of {col=$1})

{col=$1}{if((col ~ /^-?[0-9]*([.][0-9]+)?$/) && ($0!=""))
{
sum+=col;
a[x++]=col;
b[col]++
if(b[col]>hf){hf=b[col]}
}
}
END{n = asort(a);idx=int((x+1)/2)
print "Mean: " sum/x
print "Median: " ((idx==(x+1)/2) ? a[idx] : (a[idx]+a[idx+1])/2)
for (i in b){if(b[i]==hf){(k=="") ? (k=i):(k=k FS i)}{FS=","}}
print "Mode: " k
}

If i run:
awk -F'.' -f awk_mean_median_mode.txt numbers.txt
I have:
Mean: 4.28571
Median: 5
Mode: 5

If the number file contains:
2.5
3.4
4
5
6
7

Then i would get:
Mean: 4
Median: 4
Mode: 2,3,4,5,6

There is no most frequent number, thus mode prints all

Min max normalization

# Write in a script, chmod +x and run: ./myscript
# if one wants to do a min max normalization on the second field with some sorting. The input file is called number.txt
# the output file ready is final_surflex_normalize_inverse_ascending_order2.txt
# this could be used as input to compute ROC curve with scikit learn
# A potential problem is to assign a=10000000
# thus another option to get the min: awk 'min=="" || $2 < min {min=$2}; END{ print min}'
# a header is added with a tab, could be just a space if needed

#min=`awk 'BEGIN{a=10000000}{if ($2<a) a=$2} END{print a}' number.txt`

min=`awk 'min=="" || $2 < min {min=$2}; END{ print min}' number.txt`

max=`awk 'BEGIN{a=0}{if ($2>a) a=$2} END{print a}' number.txt`

awk -v varmin="$min" -v varmax="$max" '{print $1 , ($2 - varmin)/(varmax - varmin)}' number.txt | awk '{print $1, (1 - $2)}' | sort -k2,2 | awk 'BEGIN {print "ID\tscore"} {print}' > final_surflex_normalize_inverse_ascending_order2.txt

echo "the normalized file for python ROC is: final_surflex_normalize_inverse_ascending_order2.txt"

The input file number.txt:
bestenergy 23
secondbest 12
try 0
troisiemescore 3
quatriemescore 2
cinquieme_score 1

--------------------------------

Z-score

The formula for calculating a Z-score (more than 6 values, here it is only example) is z=(x-mu)/sigma, where mu is the population mean and sigma is the population standard deviation

if scores in field 1 of file energy-score.txt, assign N = 1...

MEAN
awk -v N=1 '{ sum += $N } END { if (NR > 0) print sum / NR }' energy-score.txt

input: energy-score.txt
163 b1
120 b2
130 b3
108 b4
109 b5

output
126

format the output if needed
awk -v N=1 '{ sum += $N } END { if (NR > 0) printf("%.2f\n", sum/NR) }' energy-score.txt

output
126.00

Sigma (tells how the scores are spread out from the average)
Standard deviation field 1
awk '{x+=$0;y+=$0^2}END{print sqrt(y/NR-(x/NR)^2)}' energy-score.txt

output
20.1693

awk -v m=126 -v s=20.1693 '{print ($1-m)/s}' energy-score.txt

output
1.83447
-0.297482
0.198321
-0.892445
-0.842865

or in one line:
Test1
mean=$(awk -v N=1 '{ sum += $N } END { if (NR > 0) print sum / NR }' energy-score.txt) ; echo $mean
sigma=$(awk '{x+=$0;y+=$0^2}END{print sqrt(y/NR-(x/NR)^2)}' energy-score.txt) ; echo $sigma

Simple one liner
mean=$(awk -v N=1 '{ sum += $N } END { if (NR > 0) print sum / NR }' energy-score.txt) ; sigma=$(awk '{x+=$0;y+=$0^2}END{print sqrt(y/NR-(x/NR)^2)}' energy-score.txt) ; awk -v m="$mean" -v s="$sigma" '{print ($1-m)/s}' energy-score.txt

--------------------------------

Very simple text-mining unix

input file titi.txt

less -N titi.txt

1 jjkkkl Yes
2 kkll No
3 12 Oups
4 kk123lll, horse
5
6 totooo; NONO
7
8 toppop
9 top
10 D
11 Dkoopt

sed -e '10,11d'
delete line 10 and 11

sed -e '11d'
delete line 11

cat titi.txt | wc | awk '{print "Lines: "$1"\tWords: "$2"\tCharacters: "$3}'
count all lines even if empty

cat titi.txt | tr 'A-Z' 'a-z' | tr -d [:punct:] | tr -d [:digit:] > titi_clean.txt
Output
jjkkkl yes
kkll no
oups
kklll horse

totooo nono

toppop
top
d
dkoopt

tokenisation s and c transform words into lines
cat titi_clean.txt | tr -sc 'a-z' '\12' | sort | uniq -c | sort -n
cat titi_clean.txt | tr -sc 'a-z' '\12' > titi_tokenized.txt

remove stopword
awk 'FNR==NR{for(i=1;i<=NF;i++)w[$i];next}(!($1 in w))' stop_word.txt titi_tokenized.txt

need to have a file with stop_word, one per line

--------------------------------

Read the ID of compound in a file and fetch them in another file with grep

grep -f molport-name.txt smile_molport.txt
#-f option to find all matches in one call

grep -v -f molport-name.txt smile_molport.txt

molport-name.txt
Molport1
Molport2
Molport3

smile_molport.txt
CCCCCC Molport1
CCOOOOOO Molport2
CCC Molport3
OOOOOOO Molport4

Removing last two chars from each line in a text file
awk '{sub(/..$/,"")}1' file

Find the name ID in a file, add empty line at the end, then grep this ID in the smile database file and write the output
for i in $(tr '\r' '\n' < ID-name.txt); do grep $i smile_database.txt >> matched_output.txt ; done

On OS X, maybe you need gnu grep:
instal with
brew install grep
(default install /usr/local/Cellar/grep)
run with ggrep

Go to a sub-directory and do something
Here go to all .dir and run babel to change a smi file in sdf file

for d in *.dir
do
( cd $d && j=`echo $d | sed "s/.dir//"` && babel -ismi *.smi -osdf $j.sdf )
done

Run Babel on several files
for f in *.smi ; do babel -ismi $f -osdf ${f}_results.sdf ; done

Transform but keep the filename without the extention .sdf
for f in *.sdf ; do babel -isdf $f -osmi ${f%.*}.smi ; done

for f in *.sdf ; do babel -isdf $f -osmi ${f%.*}.smi ; done; cat *.smi > TOTAL_files_smiles.txt

Print the full name of each file
list=`ls *.smi` ; echo "$list"
list=$(ls *.smi) ; echo "$list"

Add prefix or change prefix before "." and add _en
for filename in *.jpg; do mv "$filename" "prefix_$filename"; done
for filename in *.wav; do mv $filename ${filename%.*}_en.wav; done

--------------------------------

Split files
Warning check if end of lines are Linux/Unix-type

Split with Babel type:
babel big_database_docked.sdf out.sdf -m

This will produce sdf files for every molecule in the sdf file input database starting with out1.mol2, out2.mol2 ...

Split a file when finding a pattern, eg, pattern START
awk '/START/ { i++ } { print > "temp_txt" i }' myfile
will create ouput file temp_txt1...txt2...

split -l 200000 mybigfile.smi output_file
generate files with 200k lines (where -l is the number of lines in each file)
output_fileaa
output_fileab
output_fileac
output_filead...

Split large file and output many files with "about" 5 lines
awk -vc=1 'NR%5==0{++c}{print $0 > c".txt"}' Datafile.txt
AND THEN you can rename:
for filename in *.txt; do mv "$filename" "Prefix_$filename"; done;

If too many files in the directory and some boring warning after the command rm (remove) such as: argument list is too long
then one can try:
find all the files .smi and pipe them to rm
find . -name "*.smi" | xargs rm

Split file
awk 'NR%2==0' FILE > ALL_THE_EVEN_LINES
awk 'NR%2==1' FILE > ALL_THE_ODD_LINES

Split file for sentiment analysis

split a text file in 1 line and give output names with .txt, of course it depends how the sentences end..

If we need to create a new sentence after a dot, something like this

awk -F. '{ for (i=1;i<=NF;i++) printf "%s.\n",$i ;}' my file

split -l 1 myfile.txt smallfile

Find empty file
Here size is not zero but 1 or 2 bytes

find . -size 1

Put ls -lrt in a file > input

find the column with zero byte and so on size info
awk '$5==1 {print $9}' myinput.txt > filetodelete

for f in $(cat filetodelete) ; do rm "$f" ; done

add extension as the split function does not do it the way ones need most of the time

for f in *; do mv $f $f.txt ; done

--------------------------------

Search for duplicated IDs in one or two files
Maybe one needs to:

awk 'seen[$0]++ >=1' mylist_of_IDs.txt
This should print all the names that are already present in a file, this is for the full line $0, but could be by field
This way i can see if for example one drug is found to hit several targets after merging some output files...
Idea of promiscuous

awk '{print tolower($0)}' filename
to lower case

awk 'x[$1]++ == 1 { print $1 " is duplicated or present many times"}' myfile
find duplicate names or IDs in field 1

If we have:

File1: mydrug.txt
aspirin
acetylene
amphetamine

File2: molecule_wiki.txt
ammonia N
aspirin O=C(Oc1ccccc1C(=O)O)C
acetylene C#C

awk 'NR==FNR{a[$1]=$2;next} !($1 in a)' molecule_wiki.txt mydrug.txt
print: amphetamine
Meaning the molecule name present in the second file not already found in the wiki first file

Delete last line of a file on a Mac (this acts on the file itself)
sed -i '' -e '$ d' file1

Print specific field if a score is above a given value
For instance:
awk '{if ($9 > 5.0) print $9, "\t", $11, "\t", $13, "\t", $14, "\t", $15, "\t", $16}' myfile
If field 9 that contains score values is above 5 in energy value, print some other fields separated with a tab

Do not print some lines
grep -v ' -1000.00 '
Will remove lines with space -1000.00 and a space

Print the first 10 lines of a file (emulates "head -10")
awk 'NR < 11'
or if we change default FS
awk 'BEGIN {FS=","} NR < 2001 {print $1}'

Some simple examples to start
PDB file: Count residue in PDB file (of course you may have several chains and this can be checked)
awk 'BEGIN {counter = 0} {if ($3 == "CA") {counter++}} END {print counter}' myfile.pdb

To select lines with Atom in field 1 and Calpha (field 3) and get the amino acid name (present in field 4)
awk '$1 == "ATOM" && $3 == "CA" {print $4}' mypdb.pdb

or

awk '$1!="HETATM"' myfile.pdb | grep CA | awk '{print $4}' > myoutput.pdb
(to match exactly the pattern HETATM)
or
cat myPDBfile.pdb | awk '$1!="HETATM"' | grep CA | awk '{print $4}'
(NOTe:, the != negate the matching pattern)

Get the sequence from a PDB file (warning does not check for non standard amino acids and for different chains)
cat mypdb.pdb | awk '$1 == "ATOM" && $3 == "CA" {print $4}' | awk ' { gsub( /VAL/, "V"); gsub( /GLY/, "G"); gsub( /ALA/, "A"); gsub( /LEU/, "L"); gsub( /ILE/, "I"); gsub( /SER/, "S"); gsub( /THR/, "T"); gsub( /ASP/, "D"); gsub( /ASN/, "N"); gsub( /LYS/, "K"); gsub( /GLU/, "E"); gsub( /GLN/, "Q"); gsub( /ARG/, "R"); gsub( /HIS/, "H"); gsub( /PHE/, "F"); gsub( /CYS/, "C"); gsub( /TRP/, "W"); gsub( /TYR/, "Y"); gsub( /MET/, "M"); gsub( /PRO/, "P"); residues = residues $1} END {print residues }'
or
awk '$1 == "ATOM" && $3 == "CA" {print $4}' MYPDB_file.pdb | awk ' { gsub( /VAL/, "V"); gsub( /GLY/, "G"); gsub( /ALA/, "A"); gsub( /LEU/, "L"); gsub( /ILE/, "I"); gsub( /SER/, "S"); gsub( /THR/, "T"); gsub( /ASP/, "D"); gsub( /ASN/, "N"); gsub( /LYS/, "K"); gsub( /GLU/, "E"); gsub( /GLN/, "Q"); gsub( /ARG/, "R"); gsub( /HIS/, "H"); gsub( /PHE/, "F"); gsub( /CYS/, "C"); gsub( /TRP/, "W"); gsub( /TYR/, "Y"); gsub( /MET/, "M"); gsub( /PRO/, "P"); residues = residues $1} END {print residues }' > myoutput.seq

Distance between 2 atoms in a file - simple test1
if I have only the x, y, z coordinates of 2 atoms in the file: coordinates

x y z
------------------
50.211 14.979 24.196 (note: this is my atom x1)
50.142 15.162 25.415 (note: this is my atom x2)

The distance between these 2 atoms is:
sqrt((x1-x2)^2 + (y1-y2)^2 + (z1-z2)^2)

awk '{printf "%s\t", $0}' coordinates > my_2_atoms_in_a_row.txt
to make 1 row of my columns

I get this in my file my_2_atoms_in_a_row.txt:

$1 $2 $3 $4 $5 $6
x1 y1 z1 x2 y2 z2
-----------------------------------------
50.211 14.979 24.196 50.142 15.162 25.415

awk '{ a=sqrt(($1-$4)^2 + ($2-$5)^2 + ($3-$6)^2); print a}' my_2_atoms_in_a_row.txt
result is 1.23459 (here in angstroms)

Distance between 2 atoms in a file - test2
Now, if i have in my file coordinates:
50.211 14.979 24.196 (my atom x1)
50.142 15.162 25.415 (my atom x2)

Then
awk 'NR ==1 {split($0,a," ")}; NR == 2 {split($0,b," ")} END {print sqrt((a[1]-b[1])^2 + (a[2]-b[2])^2 + (a[3]-b[3])^2)}' coordinates
Result = 1.23459
or:
awk 'NR ==1 {split($0,a," ")}; NR == 2 {split($0,b," ")} END {printf "%.3f\n", sqrt((a[1]-b[1])^2 + (a[2]-b[2])^2 + (a[3]-b[3])^2)}' coordinates
Result = 1.235

Distance atomes

Some more about PDB file (can be other types as well...)

number residue type eg Ala in a PDB file
awk '{if($3=="CA" && $4=="ALA"){num++}}END{print num, "ala"}' input_pdbfile.pdb

number of charge in PDB
awk '{if($3=="CA" && ($4=="GLU" || $4=="ASP")){charge--};if($3=="CA" && \
($4=="ARG" || $4=="LYS")){charge++}}END{print "Total charge is", charge}' input_pdbfile.pdb

Approximate center of mass in PDB file
awk '{x+=$7; y+=$8; z+=$9; atom++}END{print "The center is", x/atom, y/atom, z/atom}' input_pdbfile.pdb

If we want approximate center of mass only on CA atoms
awk '{if($3=="CA"){x+=$7; y+=$8; z+=$9; atom++}}END{print "Center is", x/atom, y/atom, z/atom}' input_pdbfile.pdb

A simple example to combine data from two files
Using getline and AWK and a file with the coordinates of 4 molecules and another file with the compound IDs in the same order of these 4 molecules
I want to add the compound IDs to the coordinate file

if in a file called molecule.txt i have:
ID 44000
jkjk_coordinates...
jkk_coordinates...
END

ID 400009
jkkk_coordinates...
mvnvbc_coordinates...
END

ID 58939
jkd_coordinates...
jkjd_coordinates...
END

ID 400009
jj_coordinates...
KKL_coordinates...
END

Thus four molecules, ending with END and with an empty line at the end of the file (important to have the empty line here)

and a file called molecule_ID_numbers.txt with 4 lines and 4 ID numbers in the same order than the molecular coordinates in the file molecule.txt:
7888
9000
10000
15000

I can use getline var and AWK
This form of the getline function takes its input from the file molecule_ID_numbers.txt and puts it in the variable var

The following script reads its input record from the file molecule.txt when it encounters a first field with a value equal to ID in the current input file. It also adds an empty line after END and <CAS NUMBER> and the ID values present in the file molecule_ID_numbers.txt

awk '{ if ($1 == ID) {getline var < "molecule_ID_numbers.txt" ; print "\n", "\<CAS NUMBER\>" "\n", var, "\n\n"} else print}' < "molecule.txt" > myoutfile

The output is :
ID 44000
jkjk
jkk
END

<CAS NUMBER>
7888

ID 400009
jkkk
mvnvbc
END

<CAS NUMBER>
9000

ID 58939
jkd
jkjd
END

<CAS NUMBER>
10000

ID 400009
jj
END

<CAS NUMBER>
15000

--------------

Merge - Join two files after a matching pattern

May need to convert ID and names from different databases like with Unichem (https://www.ebi.ac.uk/unichem) or David (https://david.ncifcrf.gov/)

We have 2 files

protein-name-gene.txt
ID Protein_name
2162 coagulation factor XIII
27165 glutamine gamma
3067 histidine decarboxylase

gene_drug_interaction.txt
drugID ID
DB111 3067
DB147 27165
DB18 2162
DB40 2162
DB125 2162

These files can be joined by specifying the fields that should be used to join the files.
Common to both files is the Entrez_Gene ID, here named ID, in gene_drug_interaction.txt this is the second field (join -1 2) and here it will be our first file
In the protein-name-gene.txt file this is the first field (join -2 1) and here it will be our second file

to do this, we need first to sort the files (avoid having multiple empty lines at the end, better to remove if you have some)

sort on second field
sort -n -s -k2,2 gene_drug_interaction.txt > gene_drug_interaction_sorted.txt

on the first field
sort -n -s -k1,1 protein-name-gene.txt > protein_name_gene_sorted.txt

join -1 2 -2 1 gene_drug_interaction_sorted.txt protein_name_gene_sorted.txt
output:
ID drugID Protein_name
2162 DB18 coagulation factor XIII
2162 DB40 coagulation factor XIII
2162 DB125 coagulation factor XIII
3067 DB111 histidine decarboxylase
27165 DB147 glutamine gamma

Grep with variable, loop through a file

File 1, I have my compound ID: more name.txt
DB12
DB13
DB14
DB11
DB1234
DB14890

File 2, I have my compound ID and the SMILES for the compounds, not real smiles, here it is just to illustrate: more smiles.txt
DB12 ccccccc
DB15 OOOOOOO
DB14 nnnnnnn

cat name.txt | while read line ; do grep $line smiles.txt ; done
output:
DB12 ccccccc
DB14 nnnnnnn

-----------

Merge two files after a matching pattern - other ideas
If have two files:
file1_smiles_score.txt
Smiles name score
CCC drug1 12
CCCCC drug2 10
CCCCCCC drug3 8

file2_info_drug.txt
disease name vendor cost
type1 drug1 vendor1 high
type2 drug2 vendor2 low
type3 drug3 vendor3 low
type4 drug4 vendor4 unknown

If i run:
awk '{print FILENAME, NR, FNR, $0}' file2_info_drug.txt file1_smiles_score.txt

file2_info_drug.txt 1 1 disease name vendor cost
file2_info_drug.txt 2 2 type1 drug1 vendor1 high
file2_info_drug.txt 3 3 type2 drug2 vendor2 low
file2_info_drug.txt 4 4 type3 drug3 vendor3 low
file2_info_drug.txt 5 5 type4 drug4 vendor4 unknown
file2_info_drug.txt 6 6
file1_smiles_score.txt 7 1 Smiles name score
file1_smiles_score.txt 8 2 CCC drug1 12
file1_smiles_score.txt 9 3 CCCCC drug2 10
file1_smiles_score.txt 10 4 CCCCCCC drug3 8
file1_smiles_score.txt 11 5

NR : gives the total number of records processed
FNR : gives the total number of records for each input file

Syntax:
In awk assigning array elements:
arrayname[string]=value

arrayname is the name of the array
string is the index of an array
value is the value you are assigning to that element of the array

and the "for loop" is something like:
for (var in myarrayname)
actions

Examples:
echo 1 2 3 4 | awk '{my_arrayname[$1] = $3};END {for(i in my_arrayname) print my_arrayname[i]}'
3

echo 1 2 3 4 | awk '{my_arrayname[$1] = $4};END {for(i in my_arrayname) print my_arrayname[i]}'
4

"i" is the index
A "for loop" is needed to iterate and print content of an array

Now if I need to merge file1 and file2 by the drug-names present in field 2 of each file
And I need to add the 3rd column of file 2 to file 1
Something like this:
Smiles name score vendor
CCC drug1 12 vendor1

awk 'NR==FNR {myarray[$2] = $3; next} {print $1,$2,$3,myarray[$2]}' file2_info_drug.txt file1_smiles_score.txt
Smiles name score vendor
CCC drug1 12 vendor1
CCCCC drug2 10 vendor2
CCCCCCC drug3 8 vendor3

Notes found in awk manuals and forums:
awk 'NR==FNR {myarray[$2] = $3; next} {print $1,$2,$3,myarray[$2]}' file2.txt file1.txt

First read file2 (NR==FNR is only true for the first file (argument), FNR refers to the record number, typically the line number in the current file, NR refers to the total record number, ie, NR keeps on increasing
"next" means any further commands are skipped and are run on files other than the first one in the argument

The command saves column 3 of file2 in hash-array using column 2 (the names of the drugs) as key: myarray[$2] = $3
Then read file1 and output fields $1,$2,$3, appending the corresponding saved column from hash-array myarray[$2]
This solution should work even if the data are not in the same order, no need to sort
The: myarray[$2] = $3 saves $3 as the value and $2 as the key
It matches exactly the second column from both files

If I need to take the entire lines of file2 and merge them still by the name of the drugs present in field2
awk 'NR==FNR {myarray[$2] = $0; next} {print $1,$2,$3,myarray[$2]}' file2_info_drug.txt file1_smiles_score.txt
Smiles name score disease name vendor cost
CCC drug1 12 type1 drug1 vendor1 high
CCCCC drug2 10 type2 drug2 vendor2 low
CCCCCCC drug3 8 type3 drug3 vendor3 low

If i need to add fields 3 and 4 of file2 to file1
awk 'NR==FNR {myarray[$2] = $3; add_info[$2] = $4; next} {print $1,$2,$3,myarray[$2],add_info[$2]}' file2_info_drug.txt file1_smiles_score.txt
Smiles name score vendor cost
CCC drug1 12 vendor1 high
CCCCC drug2 10 vendor2 low
CCCCCCC drug3 8 vendor3 low

--------------

Some Unix and Awk, dealing with files

Combine two files

paste -d, file1 file2
the "," is now the separator, default paste separator is tab

Read lines in both the files alternatively
paste -d'\n' file1 file2

Create a file with Nano on mac for instance:
nano filetest.txt

then
seq 3 | xargs -I{} cp filetest.txt filetest{}.txt
ls -lrt

filetest.txt
filetest3.txt
filetest2.txt
filetest1.txt

awk '{print $1}' filetest{1..3}.txt
1
4
1
4
1
4

awk '{print $1}' filetest{1..3}.txt | column -t
1
4
1
4
1
4

awk '{print $1, $2}' filetest{1..3}.txt | column -t
1 2
4 4
1 2
4 4
1 2
4 4

With 2 files, you can find this example on the www:
awk 'FNR==NR{a[FNR]=$1; next}{print a[FNR],$1}' myfile1.txt myfile2.txt > output.txt

The AWK variable FNR is the line number of the current input file and NR is the line number of the input. The two are equal only while the first input file is being read.
The first fields of the first file are saved in the a array (a[FNR]=$1) whose keys are line numbers and whose values are the 1st fields. Then, when the second file is reached, one prints the value corresponding to its line number (a[NR]) and the current line's 1st field.

Assuming each of your files has the same number of rows:

awk -f script.awk file1.txt file2.txt file3.txt file4.txt

Contents of script.awk:

FILENAME == ARGV[1] { one[FNR]=$1 }
FILENAME == ARGV[2] { two[FNR]=$3 }
FILENAME == ARGV[3] { three[FNR]=$7 }
FILENAME == ARGV[4] { four[FNR]=$1 }

END {
for (i=1; i<=length(one); i++) {
print one[i], two[i], three[i], four[i]
}
}

NB:
By default, awk separates columns on whitespace. This includes tab characters and spaces, and any amount of these.
This makes awk ideal for files with inconsistent spacing.

warning about the lack of space <(
paste <(awk '{print $1}' filetest3.txt) <(awk '{print $3}' filetest.txt)
paste <(awk '{print $1}' filetest3.txt) <(awk '{print $0}' filetest.txt) > file4

-----

Print lines after string-match with AWK
input myfile is:
@<TRIPOS>MOLECULE
1
ghh
tripos
TRIPOS

@<TRIPOS>MOLECULE
2
kkl

kkl
TRIPOS

@<TRIPOS>MOLECULE
3
llll
toto

---
The getline command reads the next line from the file

awk '/@<TRIPOS>MOLECULE/ {getline; print $0}' myfile
or
awk '/@<TRIPOS>MOLECULE/ {getline; print;}' myfile
or
awk '/@<TRIPOS>MOLECULE/ {getline; print}' myfile
print this:
1
2
3

awk '/@<TRIPOS>MOLECULE/ {print;getline;print;}' myfile
print this:
@<TRIPOS>MOLECULE
1
@<TRIPOS>MOLECULE
2
@<TRIPOS>MOLECULE
3

Print 2 lines after a string
awk '/@<TRIPOS>MOLECULE/{x=NR+2;next}(NR<=x){print}' myfile

Print 1 line after the first string is found and stop
awk '/@<TRIPOS>MOLECULE/ {getline; print;exit}' myfile

-----
If in a Mol2 file we have:

@<TRIPOS>MOLECULE
compoundID_000

I need to fetch all the compound IDs below @<TRIPOS>MOLECULE but I do not want the numbers after the underscore
I can try with:
awk 'BEGIN {FS="_" } /@<TRIPOS>MOLECULE/ {getline; print $1}' myfile

fetch the lines after @<TRIPOS>MOLECULE or after Energy
awk '/@<TRIPOS>MOLECULE|Computed Energy/ {getline; print $1}' small_database_no_underscore.mol2

print two lines into one
input file is:
9999899989
9
ZINC00006468999
3
ZINC00006468
7
9999899985
5

the large numbers are compound IDs, and the number below is the energy score

This will print the energy first and the compound IDs
awk 'BEGIN{i=1}{line[i++]=$0}END{j=1; while (j<i) {print line[j+1] , line[j]; j+=2}}' myfile

I can combine the two:

awk '/@<TRIPOS>MOLECULE|Computed Energy/ {getline; print $1}' small_database_no_underscore.mol2 | awk 'BEGIN{i=1}{line[i++]=$0}END{j=1; while (j<i) {print line[j+1] , line[j]; j+=2}}' > outputfile1

to print this:
9 9999899989
3 ZINC00006468999
7 ZINC00006468
5 9999899985

Another way to print two lines into one:

cat myfile | paste -d, - - > ouput
Cat is not needed in fact but the writing of the command is more clear: < myfile paste -d, - - > output

paste can read different files. Instead of a file name, we can use - (dash). Paste takes the first line from file1. Then, it wants to read the first line from file2.
However, since the first line of stdin is already read and processed, the next input is to read and process the second line. This glues the second line to the first

If a good energy is 9 and bad is 3, I can sort based on the energy, I can do:
sort -n -r -k1,1 outputfile1
the -n is number, -r is reverse order, -k1,1 is first field
i get:
9 9999899989
7 ZINC00006468
5 9999899985
3 ZINC00006468

Then i can fetch only the compound IDs after sorting

sort -n -r -k1,1 outputfile1 | awk '{print $2}' > ID_numbers_sorted.txt

Combine all to try:

awk '/@<TRIPOS>MOLECULE|Computed Energy/ {getline; print $1}' small_database_no_underscore.mol2 | awk 'BEGIN{i=1}{line[i++]=$0}END{j=1; while (j<i) {print line[j+1] , line[j]; j+=2}}' | sort -n -r -k1,1 | awk '{print $2}' > ID_numbers_sorted.txt

or with the underscore in the database for the docked compound IDs (here it can be different poses of the docked compounds, _01, _02...) and with cat:
cat small_database_with_underscores_on_IDs.mol2 | awk 'BEGIN {FS="_" } /@<TRIPOS>MOLECULE|Computed Energy/ {getline; print $1}' | awk 'BEGIN{i=1}{line[i++]=$0}END{j=1; while (j<i) {print line[j+1] , line[j]; j+=2}}' | sort -n -r -k1,1 | awk '{print $2}' > ID_numbers_sorted.txt

If i have in the mol2 file:

@<TRIPOS>MOLECULE
93894861_008
52 54 0 0 0
SMALL
NO_CHARGES

@<TRIPOS>ATOM
1 Cl 31.1781 13.6798 13.8999 Cl
51 22 49 1
52 22 50 1
53 29 51 1
54 29 52 1
@<TRIPOS>PROPERTY_DATA
Total_Score | 4.9061

It means I have the underscore on the compound IDs, this I want to remove and the Energy score is not written below a line that contains a specific string, but on the same line than a string

thus i have to change the above with:

cat small_database_with_underscores_on_IDs.mol2 | awk 'BEGIN {FS="_" } /@<TRIPOS>MOLECULE/ {getline; print $1} /Total/ {print $NF}' | awk 'BEGIN{i=1}{line[i++]=$0}END{j=1; while (j<i) {print line[j+1] , line[j]; j+=2}}' | sort -n -r -k3,3 | awk '{print $4}' > ID_numbers_sorted.txt

One intermediate view is like this:

Score | 2.2927 98268304
Score | 3.9573 99806113
Score | 3.7241 99884208
Score | 4.7876 99970627

Thus i need to sort on the 3rd column only (-k3,3) and then the compound IDs are in field 4 (thus the print $4)

If underscores are a problem in the Mol2 file, they can be changed, for instance with

cat mymol2file.mol2 | awk '{gsub(/\<NO_CHARGES\>/,"NO-CHARGES")}1' | awk '{gsub(/\<Total_Score\>/,"Total-Score")}1' > myoutput.mol2

gsub is global substitution and the \< \> to mark the edge of the string to substitute to avoid changing other words

other can be: awk '{gsub(/pattern/,"replacement",column_number)}' inputfile to act on specific field

like: awk '{gsub(/1/,"0",$1);}' file

exact match also with
awk '/^toto$/' myfile
the ^ and $ for start and end of the line

To check the number of lines in the compound ID file, possible to use:

awk 'NF{c++}END{print "total: "c}' myfile

instead of wc -l (may miss the last line depending how it ends)

see below for this one:
./extract_mol2_awk_script ID_numbers_sorted.txt small_database_no_underscore.mol2 MYCOMPOUND_ENERGY_SORTED.mol2

-----------

Extract molecules from Mol2 file if it matches a list of compound IDs present in another file (see below for SDF file)

For example in one file you have the compound ID like this:
5247489
5523453
6072484
6190013

Also, important to remember it possible with Awk to define the Record Separator and the Field Separator, this way it can easier to handle MOL2 more easily. In this case, $0 (which is usually field 1) is now the complete description of a single molecule, excluding the starting signal like Tripos<molecule>.... $1 is the first line of the file (commonly used as a title with maybe the molecule name or not).

It is possible to pass Shell variable to awk, different ways...for exemple in a file you can have:
var=5
var2=7

awk -v gg=$var -v iii=$var2 'BEGIN {print gg"\t" iii}'

then, inside a file called for instance extract_mol2_awk_script, you can have this small script:
---
if [ $# != 3 ]; then echo "Usage: ./shell_file list_of_compound_IDs_one_per_line_no_space large_Mol2_bank.mol2 output_my_extracted_molecules.mol2" ; exit
fi

mylist=$1
largeMol2bank=$2
extractedMol=$3

input_list_id=`cat $mylist`

for i in $input_list_id
do
echo $i

echo "molecule mol2" $i "extracted"
awk -v myvar=$i 'BEGIN {RS="@<TRIPOS>MOLECULE" ; FS="\n"} $2==myvar {print "@<TRIPOS>MOLECULE" ; print $0 }' $largeMol2bank | awk 'NF' >> $extractedMol
done

echo "The extracted molecules are in the file:" $extractedMol

# Comments
# awk -v myvar=$i 'BEGIN {RS="@<TRIPOS>MOLECULE" ; FS="\n"} $2==myvar {print "@<TRIPOS>MOLECULE" ; print $0 }' $largeMol2bank | awk 'NF' >> $extractedMol
# With awk, NF only set on non-blank lines. When this condition matches, awk default action is to print will the whole line, this does remove empty lines
# It seems that without the blank lines, the file is ok, but to double check
#
# if the number ID in the large mol2 file has underscores, like this is a special pose number
# and this is not present in your list of compound IDs (that are usually the ID to order compounds)
# one can use: cut -f1 -d '_' small_database.mol2 > small_database_no_underscore.mol2
# the -f1 takes what is before the delimiter, defined here as "_", thus via -d_ or -d '_'
# Outputs the first field (-f1) of the file where fields are delimited by underscore (-d '_')
# check there are no other places with underscore that could be damaged by this command
#
# to use this script written into a file (extract_mol2_awk_script) made executable by chmod + x
# do: ./extract_mol2_awk_script my_IDs small_database_no_underscore.mol2 MYOUTPUT.mol2
# the IDs can be 990900 (often compound IDs are numbers but can be something else
# this works for instance if the compound ID is below @<TRIPOS>MOLECULE
# end

Extract molecules with Babel

for instance, extract the first 3 molecules of a large collection sdf file

babel -isdf input.sdf -f 1 -l 3 -osdf output.sdf

--------------

Extract molecules from SDF file with awk - a simple script

We have a file with the name of the drugs we want to extract, one name per line (file name.txt):
name2
name3

The large SDF file with the molecules - mysdf.txt:
name1
klkl
ll
pppp
$$$$

name2
oppoo
pppp
$$$$

name3
pool
plplp
$$$$

A script could be something like this in a file myscript_extract_sdf.txt, chmod +x and run with for instance ./myscript_extract_sdf.txt

echo -n "select the file with the ID of molecules, 1 ID per line no space: "
read -e file1

echo -n "select the large SDF file, each compound should start by ID and end with 4 dollars : "
read -e file2

echo -n "enter the name of the output file for your selected molecules ending with .sdf if you want : "
read -e myoutputsdf

myinputdrugnamefile=`cat $file1`
for i in $myinputdrugnamefile
do

echo "molecule" $i "extracted"
awk -v myvar=$i '$0==myvar , $0=="$$$$"' $file2 | sed '/\$\$\$\$/G' >> $myoutputsdf
done
echo "The extracted molecules are in the file:" $myoutputsdf

# This should take care of similar molecule names, like name2 and name20 should be different, but double check
# default RS is newline, default FS is whitespace
# if i need to add empty line after the 4 dollars sed '/\$\$\$\$/G'
# empty line before pattern : sed -e '/\$\$\$\$/i\\' mysdf.txt
# or sed '/\$\$\$\$/s/^/\n/' mysdf.txt
# END

If one needs to print only one molecule, for instance name1:
awk '/name1/,/\$\$\$\$/{print}' mysdf.txt
awk '/^name2$/,/\$\$\$\$/{print}' mysdf.txt
the ^ and $ to force exact match of the name, but double check if ok on your system

to print molecules 1 and 2:
awk '/name1/ || /name2/,/\$\$\$\$/{print}' mysdf.txt

--------------

Swap (shift) ID names in SDF file before the coordinates

This can really be a pain as some packages are very very sensitive to non standard SDF format
For instance, if the file starts with a blank line (this will print after the first line: awk 'NR>1' filename), the program may crash, if there is a blank - empty line after $$$$, it may crash etc

If each compound start with a string ISIS for instance, but with blank lines after the $$$$ etc, one can try some cleaning before using the name swap script:

cat my3D.sdf | awk '/ISIS/{print "ZZZZZ"}1' | awk -v n=-2 'NR==n+1 && !NF{next} /\$\$\$\$/ {n=NR}1' | awk 'NR>1' > myOUTPUT.sdf

This will put ZZZZZ before ISIS, then delete the empty line after $$$$ and remove the first blank line of the file

then Swap script:

We can have this short script in a file, do the chmod + x and can run with ./myscript

# We have here all compounds that start with something like ISIS blabla
# After the coordinates, assuming we have a flag: > <Name> and one line after the name or ID of the compound
#
#
echo -n "select the SDF file where you want to move or swap the cmp names before the coordinates:"
read -e file1
grep "\<Name\>" -A1 $file1 -c
# to count the number of IDs in the SDF file
grep "\<Name\>" -A1 $file1 | grep -v "<Name>" | grep -v "\-\-" | sed -e 's/ /-/g' > only_my_cmpd_name_from_the_SDFbank.txt
# this print the line with > <Name> and the compound ID below and two -- between some compound names
# the sed action is to replace white space in a name by - thus to something like compound-name-full
sleep 1
# sleep is not needed
echo step_1_grep_all_the_drug_names_done
grep "\<ZZZZZ\>" $file1 -c
sleep 1
echo step_2_grep_all_the_ISIS_name_on_top_of_each_cmpd_this_number_should_be_the_same_as_the_first_number_printed_above
awk '{
if ($1 == "ZZZZZ") {
getline < "only_my_cmpd_name_from_the_SDFbank.txt"
print
} else
print
}' < $file1 > "${file1%.*}"_swapped_name.sdf
sleep 1
rm only_my_cmpd_name_from_the_SDFbank.txt
echo "this is done"
# Some notes
# grep avec slash is to be specific on the string Name to exact
# Here we need to move the name of the drug compound in a SDF file located after the field Name
# on top of the compound to replace the word ISIS. This is when the 3D
# structure is generated with some packages this is needed as input for docking tools otherwise the name or ID could be lost
# First read the sdf file in a variable file1
# Then find the names and put them into a file
# then with awk, when i find ISIS in the $file1, thus the SDF file, with Awk, i getline in the file with only the name of the drug
# and i replace ISIS by the name of the drug.
# Then we write all in the root name of the file, for instance if on has myfile.sdf as input, the script takes what is before the dot and add the extension _swapped_name.sdf
# the read -e allows auto complete with tab when reading the file name
# in awk == should be exact matching
#
# in the SDF file, after the names have been moved, one may want to remove the compound ID after > <Name> to avoid having it twise, can be a problem
# for some project
# one can try, seems to run on mac as well:
# sed -e '/\<Name\>/ { N; d; }' mysdffile.sdf
# the {} gets executed if pattern found, the N is for next line and d is to delete the current line with Name and the line after with the ID
# the "\<mystring\>" as in 's/\<mystring\>/toto/g' is for exact match and g for global
# with awk something like this:
# awk '{sub(/\<mystring\>/, "toto")}1' inputfile > output
#

Another quick solution (nothing optimized here) for collections from Molport or related vendors

#
# After the coordinates, assuming we have a flag: <PUBCHEM_EXT_DATASOURCE_REGID> and one line after the compound ID
#
# usage only run the script with nothing else: ./this_script
#
# We assume that each compound starts with Mrv thus above the coordinates, something like: Mrv17110 10291723052D
# to shift the compound ID name above the coordinates and remove each line starting with Mrv and continue with anything up to the end of that line and replace with ZZZZ:
# awk '{sub(/^.*Mrv.*$/,"ZZZZZ")}1' iissc-006-500-000--006-999-999.sdf > myinput_modified.sdf
#
#
echo -n "select the SDF file where you want to move or swap the cmp names before the coordinates:"
read -e file1

grep "\<PUBCHEM_EXT_DATASOURCE_REGID\>" -A1 $file1 -c
# to count the number of compound IDs present after this REGID line in the SDF file

grep "\<PUBCHEM_EXT_DATASOURCE_REGID\>" -A1 $file1 | grep -v "\<PUBCHEM_EXT_DATASOURCE_REGID\>" | grep -v "\-\-" > only_my_cmpd_name_from_the_SDFbank.txt
# this print the line with <PUBCHEM_EXT_DATASOURCE_REGID\> and the compound ID that is written below that line and also two -- between some compound names
# and the grep -v will remove this
sleep 1

# sleep is not needed
echo step_1_grep_all_the_drug_names_done

awk '{sub(/^.*Mrv.*$/,"ZZZZZ")}1' $file1 > mymodified_molport_tmp_file.sdf

cat mymodified_molport_tmp_file.sdf | grep "\<ZZZZZ\>" -c

sleep 1
echo step_2_grep_all_the_Mrv_name_on_top_of_each_cmpd_this_number_should_be_the_same_as_the_first_number_printed_above

cat mymodified_molport_tmp_file.sdf | awk '{
if ($1 == "ZZZZZ") {
getline < "only_my_cmpd_name_from_the_SDFbank.txt"
print
} else
print
}' > mydatabase_swapped_name.sdf
sleep 1
rm only_my_cmpd_name_from_the_SDFbank.txt
rm mymodified_molport_tmp_file.sdf
echo "this is done"
#
#

--------------

Extract the top X molecules of a large mol2

This script has to be a in file, then chmod + x and run with, depending the system, with: ./myscript

myinputbank=$1

howmany_cmpd_doyou_want=$2

fixcountloop=$(($howmany_cmpd_doyou_want + 1))

echo "the top " $howmany_cmpd_doyou_want " molecules will be written in Top_extracted_molecules.mol2"

awk 'BEGIN { RS="@<TRIPOS>MOLECULE" ; FS = "\n" ; count=0 }

$1=="@<TRIPOS>MOLECULE"
{ count++
if ( count < 2 )
print $0
else if (( count > 1) && (count <= howmany ))
print "@<TRIPOS>MOLECULE" $0 > "Top_extracted_molecules.mol2"

}' howmany=$fixcountloop $myinputbank

exit
fi

# usage ./this_script largeINPUTbank.mol2 number_of_mol_i_want_to_extract

--------------

Extract molecules from SDF files

SD files exported from some packages do not follow the SDF format. You may need to add something, like some flags...for instance you may need to add a line with 'M END' before every terminating '$$$$'. This can be done like this. The output field separator (OFS) should be set to null to prevent additional newlines (test this by setting OFS="\n", the default).

awk 'BEGIN {RS="\\$\\$\\$\\$\n"; OFS=""} {print $0,"M END\n$$$$"}' badSDFformat.sdf > Clean.sdf

SDF file should be
Compound_IDs
coordinates
$$$$
Compound_IDs

(no space between $$$$ and compound_IDs)

To extract one molecule from one SD file, we assume the ID number starts each compound in field 1, the end of the compound is always $$$$
One can try :

awk 'BEGIN {RS="\\$\\$\\$\\$\n"; FS="\n"} $1==patulin {print $0; print "$$$$"}' largefile.sdf > one_molecules.sdf

or if it add twice the $$$$
awk 'BEGIN {RS="\\$\\$\\$\\$\n"; FS="\n"} $1==patulin {print $0}' largefile.sdf > one_molecules.sdf

Extract a list of molecules, if you have the IDs of the compounds in a file and the large SDF file in another file
In one file you have the ID like this:
9154647
9155875...

With this script in a file and chmod +x:
#
if [ $# != 3 ]; then echo "Usage: ./shell_file list_of_IDs_one_per_line_no_space large_SDF_bank.sdf file_name_of_extracted_molecules.sdf" ; exit
fi

mylist=$1
largeSDFbank=$2
extractedMol=$3

input_list_id=`cat $mylist`

for i in $input_list_id
do
echo $i

echo "molecule" $i "extracted"
awk -v myvar=$i 'BEGIN {RS="\\$\\$\\$\\$\n" ; FS="\n"} $1==myvar {print $0}' $largeSDFbank >> $extractedMol
done

echo "The extracted molecules are in the file:" $extractedMol

# Usage: ./this_shell_file list_of_IDs_one_per_line_no_space large_SDF_bank.sdf file_name_of_extracted_molecules.sdf "
# NOTE WARNING MAYBE YOU DO NOT NEED TO REMOVE BLANC LINES WITH THE FIRST SED COMMAND NOR ADD AN EMPTY LINE BEFORE DOLLARS
# IF SO DELETE THE SLEEP AND SED....LINES AND END WITH echo "The extracted molecules are in the file:" $extractedMol
# if $$$$ missing at the end use:
# awk -v myvar=$i 'BEGIN {RS="\\$\\$\\$\\$\n" ; FS="\n"} $1==myvar {print $0 ; print "$$$$"}' $largeSDFbank >> $extractedMol
#
#

Repeat n time a string

Repeat UK 3 times, all on the same line, on Mac, the \n does not seem ok

awk 'BEGIN{c=0; do{printf "UK "; c++}while(c<3); printf "\n" }'

To have it on different lines

awk 'BEGIN{c=0; do{printf "UK "; c++}while(c<3); printf "\n" }' | awk -v OFS='\n' '{$1=$1}1'

Some info for Y-randomization in a column

File with dependent and independent variables in a file:

Y
CLASS MW Sv Se Sp
0
0
1

Remove blank empty line if any
awk 'NF > 0' filename > no-blank-filename

Then wc -l gives 1885 lines including the header
I thus need 1884 0 or 1 at random to replace my original true Y CLASS

gshuf -i 0-1 -r -n 1884 -o myrandom_class.txt
Generates at random a column with values in range 0 or 1 (-i 0-1) and repeated (-r) 10 times (-n 10)

Then add header
awk 'BEGIN{printf("CLASS\n");} {print;}' myrandom_class.txt > myrandom_class_header.txt

Merge the two files
I have: myrandom_class_header.txt
and: my-original-data-original_with-true-CLASS-column.txt
paste myrandom_class_header.txt my-original-data-original_with-true-CLASS-column.txt > original-data-y-random-via-paste.txt

Then in python Pandas i can just check and delete the original CLASS column : del df['CLASS']

or with unix
remove field 1 (but here format is lost)
awk '{$1=""; print }' filename

Remove first field with awk but separate with single space the output file
awk '{$1=""}1' original_data_noblank.txt | awk '{$1=$1}1' > output

Remove first field with awk but with tab separated output column (should work also on mac)
awk '{$1=""}1' original_data_noblank.txt | awk -vOFS="\t" '{$1=$1}1' > output

Other tips:
Count the number of fields
awk '{print NF; exit}' original_data.txt

Replace field 1 in no-blank-filename with the field 1 of a file myrandom_class_header.txt:
awk 'FNR==NR{a[NR]=$1;next}{$1=a[FNR]}1' myrandom_class_header.txt no-blank-filename > modified-file.txt

If there is a need of tab instead of one space to separate the column (but warning about the original formatting that could be lost):
awk '{$1=$1}1' OFS="\t" myfile.txt > myfile_TAB.txt

Get some random compounds (lines) in a SMILES file
On mac, this requires to do: brew install coreutils

If a user wants approximately 1% of the non-blank lines
cat input.txt | awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01) print $0}' > sample.txt

In perl but faster. Does not put all the input file in memory
perl -ne 'print if (rand() < .01)' huge_file.csv > sample.csv

Another example taken from the internet
Samples 1000 lines from a 1m line file:
perl -ne 'print if (rand() < .0012)' huge_file.csv | head -1000 > sample.csv

With AWK preserving a header row, and when the sampling can be an approximate percentage of the file. Works for very large files:
awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01 || FNR==1) print > "data-sample.txt"}' data.txt

Another way
awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01) print $0}'

The shuf utility will be there as gshuf
then: To get 10 lines at random
gshuf myfile_one_cmpd_per_line.smi -n 10 -o output-file-10-random-lines.txt

To remove these 10 random lines from the original file, a possible way is with

grep -Fvxf output-file-10-random-lines.txt myfile_one_cmpd_per_line.smi > my-original-file_one-cmpd-per-line-minus--the-10-random-lines.txt

-F: use literal strings
-x: only consider matches that match the entire line
-v: print non-matching
-f file: take patterns from the given file

Get some compounds at random (pseudo) from SDF file
We get the SDF file from Pubchem
Each compound starts with ID number
Ends with $$$$ and no empty line between the $$$$ and the following ID number
I can try something like this to get the compound IDs and then select at random X compound IDs from the list
and put the output in a new SDF file
On mac, this requires to do: brew install coreutils
Such as to have gshuf

Copy the script below in a file myscript_random_sdf, do the chmod +x and run with ./myscript_random_sdf

# the read -e allows tab auto-completion for the file names
# echo with some special info to color the text
#echo -n "select the file with the ID of molecules, 1 ID per line no space: "

echo -n -e "\033[1;31mselect the file with the ID of molecules, 1 ID per line no space: \033[0m"
read -e file1

echo -n -e "\033[1;31mselect the large SDF file, each compound should start by ID and end with 4 dollars : \033[0m"
read -e file2

echo -n -e "\033[1;31menter the name of the output file for your selected molecules ending with .sdf : \033[0m"
read myoutputsdf

echo -n -e "\033[1;31mhow many compounds do you want to extract at pseudorandom from the ID file : \033[0m"
read random_selection

jo=`cat $file1 | gshuf -n $random_selection`
for i in $jo
do
echo "molecule" $i "extracted"

awk -v myvar=$i 'BEGIN {RS="\\$\\$\\$\\$\n" ; FS="\n"} $1==myvar {print $0 ; print "$$$$"}' $file2 >> $myoutputsdf
done

echo "The extracted molecules are in the file:" $myoutputsdf

# GNU shuf or gshuf is installed on mac with: brew install coreutils
# tested like this: seq 100 | gshuf -n 3
# sequence of 100 and random select 3 numbers from 100

# The read -e should allow autocomplete of filenames with tab
# To get the compound IDs from the SDF file I can do something simple
#
# awk '/\$\$\$\$/ {getline; print $0;}' mysdffile
# but this can print the last $$$$ if there is not empty line after
# I can add some new empty lines at the end of the file with:
#
# I get the first line of mysdf file with : awk 'NR==1' myfile
# add two new lines at the SDF file only if there are none
# awk '/^$/{f=1}END{ if (!f) {print "\n\n"}}1' mysdf > output
# or print some empty lines in all case
# awk 'END {print "\n\n"}1' mysdf > output
# remove empty lines if needed: awk 'NF' myfile
# If i combine all I can have something like this to get all the compound IDs of the pubchem SDF file:
# cat myPubChemfile.sdf | awk 'END {print "\n\n"}1' | awk 'NR==1 ; /\$\$\$\$/ {getline; print $0;}' | awk 'NF' > myIDs.txt
#
#warning, this may add $$$$ twice if it extracts the last compound of the SDF file, this has to be checked and if so deleted
#could be changed in the script but no time. For some software packages it does not matter for others, it may crash the tool

Find string in common between two files without sorting
File1:
10
20
30
3000

File2:
10
3000

grep -F -f file1 file2
Print:
10
3000

the -F, --fixed-strings
-f FILE, --file=FILE
Can be slow on large files

Warning grep can do some strange things (regular expression versus plain string, blank lines..)

or equivalent with awk (keep the order of file 2)
awk 'NR==FNR{a[$1]++;next} a[$1]' file1 file2

awk 'NR==FNR{arr[$0];next} $0 in arr' file1 file2
This will read all lines from file1 into the array arr[], and then check for each line in file2 if it already exists within the array (i.e. file1). The lines that are found will be printed in the order in which they appear in file2. The comparison in arr uses the entire line from file2 as index to the array, so it will only report exact matches on entire lines.
or
find what is not in common
awk 'NR==FNR {exclude[$0];next} !($0 in exclude)' file1 file2

print:
20
30

Molecules in two files, collect data from the two files. Simple scripts when you have no time

Finding data files. For instance, to get only the names and smiles of drugcentral file (warning there might be duplicates... THIS should be sorted out before running this job or after...)
awk '{print $1, "\t", $4}' filedrugcentral.tsv

Then file 1: BDDCS file with name of the drug and then the class type, thus from 1 to 4..., here only class 1 is shown
abacavir 1
acarbose 1
acebutolol 1
acetaminophen 1
paracetamol 1
...........

And file 2: Drugs from DrugCentral with the smiles and the name
NC(=O)C1=C(N)N(C=N1)[C@@H]1O[C@H](CO)[C@@H](O)[C@H]1O acadesine
CC(=O)NCCCS(O)(=O)=O acamprosate
C[C@H]1O[C@H](O[C@@H]2[C@@H](CO)O[C@H](O[C@@H]3[C@@H](CO)OC(O)[C@H](O)[C@H]3O)[C@H](O)[C@H]2O)[C@H](O)[C@@H](O)[C@@H]1N[C@H]1C=C(CO)[C@@H](O)[C@H](O)[C@H]1O acarbose
CCCC(=O)NC1=CC(C(C)=O)=C(OCC(O)CNC(C)C)C=C1 acebutolol
CCC(Br)(CC)C(=O)NC(=O)NC(C)=O acecarbromal
OC(=O)COC(=O)CC1=C(NC2=C(Cl)C=CC=C2Cl)C=CC=C1 aceclofenac
NC1=CC=C(C=C1)S(=O)(=O)C1=CC=C(NCC(O)=O)C=C1 acediasulfone

---

awk 'FNR==NR{f1[$1]=$0;next}; $2 in f1 {print $0}' bddcs.txt drug_central_smiles_only.txt > matchedRecords.txt
will print something like this, thus the smiles and name of the drugs of file2 (DrugCentral), if the name was present in the BDDCS file
CCCC(=O)NC1=CC(C(C)=O)=C(OCC(O)CNC(C)C)C=C1 acebutolol
CC(=O)NC1=CC=C(O)C=C1 paracetamol
CC(=O)NC1=NN=C(S1)S(N)(=O)=O acetazolamide
CC(=O)C1=CC=C(C=C1)S(=O)(=O)NC(=O)NC1CCCCC1 acetohexamide

awk 'FNR==NR{f1[$1]=$0;next}; $2 in f1 {print $0, f1[$2]}' bddcs.txt drug_central_smiles_only.txt > matchedRecords.txt
will print something like this with the BDDCS class type added:
CCCC(=O)NC1=CC(C(C)=O)=C(OCC(O)CNC(C)C)C=C1 acebutolol acebutolol 1
CC(=O)NC1=CC=C(O)C=C1 paracetamol paracetamol 1
CC(=O)NC1=NN=C(S1)S(N)(=O)=O acetazolamide acetazolamide 4
CC(=O)C1=CC=C(C=C1)S(=O)(=O)NC(=O)NC1CCCCC1 acetohexamide acetohexamide 1
CC(=O)OC1=CC=CC=C1C(O)=O acetylsalicylic acetylsalicylic 1

If i want to generate a csv file and replace all tabs and white space with coma, for instance for DataWarrior
sed 's/[[:space:]]/,/g' matchedRecords.txt > matchedRecords_with_coma.csv
if several "," can be replaced with whatever, a text editor or command line

Find IDs in common or different in two files, some kind of Venn diagram could be done
File1
2000_2
3000_1
8800_00
304_09

File2
8800_00
304_09

unique data in file1:
awk 'BEGIN {FS = "_"} NR==FNR{a[$1];next}!($1 in a)' file2.txt file1.txt

unique data in file2:
awk 'BEGIN {FS = "_"} NR==FNR{a[$1];next}!($1 in a)' file1.txt file2.txt

common ID numbers in two files:
awk 'BEGIN {FS = "_"} NR==FNR{a[$1];next} ($1 in a)' file1.txt file2.txt

NR==FNR - Execute next block for 1st file only
a[$0] - Create an associative array with key as '$0' (whole line or $1 for field 1) and copy that into it as its content
next - move to next row
($0 in a) - For each line saved in `a` array:
print the common lines from 1st and 2nd file "($0 in a)' file1 file2"
or unique lines in 1st file only "!($0 in a)' file2 file1"
or unique lines in 2nd file only "!($0 in a)' file1 file2"

To check I can merge the 3 output files uniquefile1, uniquefile2 and commonFile1_and_2
awk 'BEGIN {FS = "_"} {print $1}' all_output_merged.txt | sort -n | uniq -d

In order to print the duplicate lines, one way here as we have only one field with numbers:
(it does not say how many times the string is found)
sort file1.txt | uniq -d

Or with AWK (it does not say how many times the string is found)
awk '{i=1;while(i <= NF){a[$(i++)]++}}END{for(i in a){if(a[i]>1){print i,a[i]}}}' file1.txt

Add a header (insert line) with awk (tab delimited, works ok on mac OSx)
If i have a tab delimited file with
smiles-strings then the name and then some score, to add tab delimited header (Smiles Name Score) before i can do:
awk 'BEGIN{printf("Smiles\tName\tScore\n");} {print;}' myfile_no_header > myfile_with_header

--------------

Compare old and new list of SMILES and drugs - diff and here sdiff to see side by side differences only for fields 1 and 5, which could be SMILES strings and field 5 the name of the molecules
sdiff -a -i -s -WB -w 300 <(awk '{print $1, "\t", $5}' old_database_drugs.tsv) <(awk '{print $1, "\t", $5}' last_database_drugs.tsv) > found_differences_old_new_files

-w change default width to print on the terminal, 300 is more than the default

-i ignore-case, -W ignore-all-space -B ignore blank lines -s suppress commonlines, -a for text file

Some double checking of the diff command just above

If i have one file with an old version of compounds in smiles and a new one, with some additional compounds here and there
I could cat the two files and remove duplicate lines
Something like this could be tried

This remove duplicate lines, ignore case
awk '!seen[toupper($0)]++' file

If no problems with upper and lower, remove duplicate lines
awk '!seen[$0]++' file

Here this will show the extra duplicate lines
awk 'seen[$0]++' file

if extra white space between strings (Tab not changed), remove duplicate lines
(will not work if extra white space at the end of last string for instance)
awk '{gsub(/[ ]+/," ")}!seen[$0]++'

This will remove extra white space and also Tab
awk '$1=$1' file

If combine with upper-lower and extra white space to remove duplicate lines
cat fichier.txt | awk '$1=$1' | awk '!seen[$0]++'

For instance in file we have:
Aspirin 23
aspirin 23
vitamin CCCC
vitamin CCCC
vitamin CCCC
VITamin CCCC

Thus different spaces, some strings are followed by Tab
This:
cat fichier.txt | awk '$1=$1' | awk '!seen[toupper($0)]++'

will remove duplicate file print:
Aspirin 23
vitamin CCCC

--------------

Affinity

pIC50 = -logIC50
IC50 = 10^-pIC50

If IC50 = 900 nM then
pIC50 = -log(900*10^-9) = 6.04
or
If IC50 = 0.9 microM
pIC50 = -log(0.9*10^-6) = 6.04

IC50 = 10^-6.04 = 0.0000009120108394 and if we need in microM then 0.0000009120108394 * 10^6 which gives 0.912 microM (or 912 nM)... something like that

If in a file I have my molecule names and affinity in for instance pIC50 but, bad luck, the software wants the values in IC50.
Thus, in file myfile.txt:

name,IC50
chembl10090009,6.04
chembl10090009,7.04

Then to go step by step in this unit conversion, a dirty script but simple, I can do something like this (skip header name,IC50 i can: NR>1):
awk 'NR>1 {print $0}' myfile.txt | awk 'BEGIN{FS=OFS=","} {$2=-$2}1' | awk -F"," 'BEGIN{print "name"",""IC50"}{print $1","(10^$2*10^6)}'

If i need pIC50, i can do something like:
The log function of awk calculates the natural logarithm, not the base-10 logarithm. To calculate the negative base-10 log for pIC50:

if the IC50 value is in column 16 in a file
awk -F"," '{a = -log($16)/log(10); printf("%0.4f\n", a)}' myfile.txt

If in field 1, i have IC50 of 62.2222, I should get a pIC50 of -1.79395
awk '{a = -log($1)/log(10); printf("%0.4f\n", a)}' myfile.txt gives: -1.79395

--------------

AWK and SED
An awk program is a sequence of statements of the form:
pattern { action }
pattern { action }

Each line of input is matched against each of the patterns. For each pattern that matches, the associated action is
executed. When all the patterns have been tested, the next line is fetched and the matching starts over.

Examples
awk 'NR == 1 {print $2}' mydatatfile
This says select line one and do action {..}, here print field 2

awk 'length > 10' myfile
This prints every line that is longer than 10 characters

awk '{ $1 = log($1); print }' myfile
Replaces the first field of each line by its logarithm

awk '$2 ~ /A|B|C/' myfile
Prints all input lines with an A, B, or C in the second field

awk '$2 ~/0.1/' myfile > myoutput
Prints all lines with 0.1 in the second field and copy them in file myoutput

awk '{print "end"; print $0}' myfile
This prints end in between each line

awk '{print "\$" $0}' myfile
prints $ infront each line

sed 's/\$/\"/'
substitutes each $ by "

sed 's/\> \<ID\>/\> \<ID_STRUCTURE\>/g' file_input.sdf > fileoutput.sdf
substitutes each > <ID> by > <ID_STRUCTURE>

This prints the last field of each line
awk '{ print $NF }' myfile

Print the last field of the last line
awk '{ field = $NF }; END{ print field }' myfile

Print every line with more than 4 fields
awk 'NF > 4' myfile

Print every line where the value of the last field is > 4
awk '$NF > 4' myfile

Some other examples
awk 'BEGIN {i=1; while (i<=10){ print i*i; i++}}'

awk 'BEGIN {col = 13; {print col}}'

awk 'BEGIN {lines=0} {lines++} END {print lines}' myfile
(somehow like wc -l)

Change the field separator:
if in myfile i have:
uuuu:kkkk:lllll5676
uuuu:kkkk:lllll8999
uuuu:kkkk:lllll00999

awk 'BEGIN {FS=":"} {print $2}' myfile
I force the separator to be : and I keep field 2

awk '$3 == 0 {print $1}' myfile1 myfile2
If field 3 = 0, print field 1 of my two files

awk '$2 > 0.5 {col = col +1} END {print col}'
prints the number of time you have numbers above 0.5 in field 2

Print lines with the word "brian" in them:
awk '/brian/ { print $0 }' myfile

Print each input line preceded with a line number
print the heading which includes the name of the file
awk 'BEGIN { print "File:", FILENAME } { print NR, ":\t", $0 }' myfile

awk '{name = name $1} END {print name}'
Print all on one line

Insert 5 blank spaces at beginning of each line
awk '{sub(/^/, " ");print}'

Substitute "foo" with "bar" EXCEPT for lines which contain "baz"
awk '!/baz/{gsub(/foo/, "bar")};{print}'

Change "scarlet" or "ruby" or "puce" to "red"
awk '{gsub(/scarlet|ruby|puce/, "red"); print}'

Remove duplicate, consecutive lines (emulates "uniq")
awk 'a !~ $0; {a=$0}'

Remove duplicate, nonconsecutive lines
awk '! a[$0]++' # most concise script
awk '!($0 in a) {a[$0];print}' # most efficient script

Print the first line
awk 'NR <2' test

Print the last line of a file (emulates "tail -1")
awk 'END{print}'

Print only lines which match regular expression (emulates "grep")
awk '/regex/'

Print only lines which do NOT match regex (emulates "grep -v")
awk '!/regex/'

Print section of file from regular expression to end of file
awk '/regex/,0'
awk '/regex/,EOF'

Print section of file based on line numbers (lines 8-12, inclusive)
awk 'NR==8,NR==12'

Print line number 52
awk 'NR==52'
awk 'NR==52 {print;exit}' # more efficient on large files

Print section of file between two regular expressions (inclusive)
awk '/Iowa/,/Montana/' # case sensitive

Delete ALL blank lines from a file (same as "grep '.' ")
awk NF myfile > myoutput
awk '/./'

awk '{n=5 ; print $n}' myinput
prints the fifth field in the input record

To list a file, but skip over all the blank lines at the start of the file, use the command:
awk "/[^ ]/ { copy=1 }; copy { print }" filename.ext

To list all the lines of a file (TMP.LOG), except those containing the string "Frame overlap", you can use the command:
awk "!/Frame overlap/" TMP.LOG

Adding a blank line after lines in a list
To add a blank line after all lines containing "util.h":
awk "{ print $0; if ( $0 ~ /util.h/) print \"\" }" TMP.TMP

expression: none do this for all lines
action: print $0; if ( $0 ~ /util.h/ ) print "" print the line, then if the line contains "util.h" print a blank line

to add a blank line at the end of a file:
awk '{print $0} END {print ""}' myfile.txt > myoutput.txt

Dealing with $ in SDF files
sed '
/$$$$/ {
N
/\n.*ISIS/ {
s/$$$$.*\n.*ISIS/$$$$ ISIS/
}
}'

awk 'BEGIN {for (x=1; x<=50; ++x) {printf("%3d\n",x) >> "tfile"}}'
dumps the numbers from 1 to 50 into "tfile".

Output can also be "piped" into another utility with the "|" ("pipe") operator. One can pipe output to the "tr" ("translate") utility to convert it to upper-case:
awk 'BEGIN { print "this is a test"}' | tr "[a-z]" "[A-Z]"
yields:
THIS IS A TEST

To run AWK
-You can type: awk '{.....}' myinputfile > myoutputfile
or >> myoutputfile to append to an existing file

-You can put the script in a file and run: awk -f myscript myfile...and many other ways around
things can depend on the shell you are using, sh, csh, tcsh, bash...so watch out the behavior

-F fs
Sets the FS variable to fs (see section Specifying how Fields are Separated).
-f source-file
Indicates that the awk program is to be found in source-file instead of in the first non-option argument.
-v var=val
Sets the variable var to the value val before execution of the program begins. Such variable values are available inside the BEGIN rule (see below for a fuller explanation). The `-v' option can only set one variable, but you can use it more than once, setting another variable each time, like this: `-v foo=1 -v bar=2'.

ESSENTIAL SYNTAX
Arithmetic
Operator Type Meaning
+ Arithmetic Addition
- Arithmetic Subtraction
* Arithmetic Multiplication
/ Arithmetic Division
% Arithmetic Modulo
++ increment
-- decrement
^ exponential
+= plus equals
-= minus equals
*= multiply equals
/= divide equals
%= modulus equals
^= exponential equals

awk 'BEGIN {OFS="\t"} {print $1,$2,$3,(($1+$2+$3)/3)}' IN > OUT
Will print out column 1, column2, column 3, and the mean of the 3 columns

awk '{print ($1/66),($2/6430),($3/627)}' IN>OUT
Dividing each column by different numbers

awk '{printf "%.3f\t%.3f\t%.3f\n", ($1/66),($2/6430),($3/627)}' IN>OUT
printf to format 3 significant digits and separate with tabs

Conditional expressions
Operator Meaning
== Is equal
!= Is not equal to
> Is greater than
>= Is greater than or equal to
< Is less than
<= Is less than or equal to

Regular Expression Operators
Operator Meaning
~ Matches
!~ Doesn't match

ex: word !~ /START/

AND and OR and not matching
&& and || and !

Built-in VARIABLES
Records and Fields
Awk input is divided into records terminated by a record separator. The default record separator is a newline, so by default awk processes its input a line at a time. The number of the current record is available in a variable named NR.
Each input record is considered to be divided into fields. Fields are normally separated by white space --blanks or tabs -- but the input field separator may be changed. Fields are referred to as $1, $2, and so forth, where $1 is the first field, and $0 is the whole input record itself. Fields may be assigned too. The number of fields in the current record is available in a variable named NF.
The variables FS and RS refer to the input field and record separators; they may be changed at any time to any single character. The optional command-line argument -Fc may also be used to set FS to the character c.
The variable FILENAME contains the name of the current input file.

print $0, this prints the full line, if it has 8 fields, it does equivalent to:
print $1, $2, $3, $4, $5, $6, $7, $8

FS - The Input Field Separator
The input field separator, a blank by default

FNR
The input record number in the current input file

OFS - The Output Field Separator
The output field separator, a blank by default

ORS - The Output line record Separator
The output record separator, by default a newline

NF - The Number of Fields
Awk counts the number of fields in the input line and put it into a variable called NF.
awk '{print NF, $NF}' myinput
print the number of field and the last field of each line

NR - The Number of Records - the current input line number
Awk counts the number of lines it reads
awk '{print NR, $0}'
This prints the line number and the complete line

RS - The input line Record Separator (default = newline)
The input record separator, by default a newline

Change the RS

'BEGIN {
# change the record separator from newline to nothing
RS=""
# change the field separator from whitespace to newline
FS="n"
}
{
# print the second and third line of the file
print $2, $3}' myfile

if myfile is:
50.211 14.979 24.196
50.142 15.162 25.415

awk 'BEGIN {RS=""; FS="n"} {print $2}' myfile
gives:
50.142 15.162 25.415

Arrays
awk provides single dimensioned arrays. Arrays need not be declared, they are created in the same manner as awk user defined variables.
Elements can be specified as numeric or string values.

Length
this counts the number of characters in a string
awk '{print length($0)}' myfile

Print and Printf
The print statement does output with simple, standardized formatting. You specify only the strings or numbers to be printed, in a list separated by commas. They are output, separated by single spaces, followed by a newline. The statement looks like this:
print item1 , item2 , ...

The simple statement print with no items is equivalent to print $0
it prints the entire current record. To print a blank line, use 'print ""', where "" is the null, or empty, string.

Using printf Statements for Fancier Printing
A format specifier starts with the character % and ends with a format-control letter; it tells the printf statement how to output one item. The format-control letter specifies what kind of value to print. The rest of the format specifier is made up of optional modifiers which are parameters such as the field width to use.

Here is a list of the format-control letters:
c This prints a number as an ASCII character. Thus, 'printf "%c", 65' outputs the letter A. The output for a string value is the first character of the string.
d This prints a decimal integer.
i This also prints a decimal integer.
e This prints a number in scientific (exponential) notation. For example,
printf "%4.3e", 1950

prints 1.950e+03, with a total of four significant figures of which three follow the decimal point. The 4.3 are modifiers, discussed below.
f This prints a number in floating point notation.
g This prints a number in either scientific notation or floating point notation, whichever uses fewer characters.
o This prints an unsigned octal integer.
s This prints a string.
x This prints an unsigned hexadecimal integer.

% This isn't really a format-control letter, but it does have a meaning when used after a %: the sequence `%%' outputs one '%'. It does not consume an argument.

A format specification can also include modifiers that can control how much of the item's value is printed and how much space it gets. The modifiers come between the '%' and the format-control letter. Here are the possible modifiers, in the order in which they may appear:
'-'
The minus sign, used before the width modifier, says to left-justify the argument within its specified width. Normally the argument is printed right-justified in the specified width. Thus,
printf "%-4s", "foo"

prints 'foo '.

'width'
This is a number representing the desired width of a field. Inserting any number between the '%' sign and the format control character forces the field to be expanded to this width. The default way to do this is to pad with spaces on the left. For example,
printf "%4s", "foo"

prints ' foo'. The value of width is a minimum width, not a maximum. If the item value requires more than width characters, it can be as wide as necessary. Thus,
printf "%4s", "foobar"

prints 'foobar'. Preceding the width with a minus sign causes the output to be padded with spaces on the right, instead of on the left.
'.prec'
This is a number that specifies the precision to use when printing. This specifies the number of digits you want printed to the right of the decimal point. For a string, it specifies the maximum number of characters from the string that should be printed.

The C library printf's dynamic width and prec capability (for example, "%*.*s") is supported. Instead of supplying explicit width and/or prec values in the format string, you pass them in the argument list. For example:
w = 5
p = 3
s = "abcdefg"
printf "<%*.*s>\n", w, p, s

is exactly equivalent to
s = "abcdefg"
printf "<%5.3s>\n", s

Both programs output '<**abc>'. (the bullet symbol "*" is used to represent a space, to clearly show you that there are two spaces in the output.)

PRINT AND PRINTF again
The simplest output statement is the by-now familiar "print" statement. There's not too much to it:

• "Print" by itself prints the input line.

• "Print" with one argument prints the argument.

• "Print" with multiple arguments prints all the arguments, separated by spaces (or other specified OFS) when the arguments are separated by commas, or concatenated when the arguments are separated by spaces.

* The "printf()" (formatted print) function is much more flexible, and trickier. It has the syntax:
printf(<string>,<expression list>)

The "string" can be a normal string of characters:
printf("Hi, there!")

This prints "Hi, there!" to the display, just like "print" would, with one slight difference: the cursor remains at the end of the text, instead of skipping to the next line, as it would with "print". A "newline" code ("\n") has to be added to force "printf()" to skip to the next line:
printf("Hi, there!\n")

So far, "printf()" looks like a step backward from "print", and if you use it to do dumb things like this, it is. However, "printf()" is useful when you want precise control over the appearance of the output.

The trick is that the string can contain format or "conversion" codes to control the results of the expressions in the expression list. For example, the following program:
BEGIN {x = 35; printf("x = %d decimal, %x hex, %o octal.\n",x,x,x)}

-- prints:
x = 35 decimal, 23 hex, 43 octal.

The format codes in this example include: "%d" (specifying decimal output), "%x" (specifying hexadecimal output), and "%o" (specifying octal output). The "printf()" function substitutes the three variables in the expression list for these format codes on output.

* The format codes are highly flexible and their use can be a bit confusing. The "d" format code prints a number in decimal format. The output is an integer, even if the number is a real, like 3.14159. Trying to print a string with this format code results in a "0" output. For example:
x = 35; printf("x = %d\n",x) yields: x = 35
x = 3.1415; printf("x = %d\n",x) yields: x = 3
x = "TEST"; printf("x = %d\n",x) yields: x = 0

* The "o" format code prints a number in octal format. Other than that, this format code behaves exactly as does the "%d" format specifier. For example:
awk 'BEGIN {x = 255; printf("x = %o\n",x)}' yields: x = 377

* The "x" format code prints a number in hexadecimal format. Other than that, this format code behaves exactly as does the "%d" format specifier. For example:
x = 197; printf("x = %x\n",x) yields: x = c5

* The "c" format code prints a character, given its numeric code. For example, the following statement outputs all the printable characters:
BEGIN {for (ch=32; ch<128; ch++) printf("%c %c\n",ch,ch+128)}

* The "s" format code prints a string. For example:
x = "jive"; printf("string = %s\n",x) yields: string = jive

* The "e" format code prints a number in exponential format, in the default format:
[-]D.DDDDDDe[+/-]DDD

For example:
x = 3.1415; printf("x = %e\n",x) yields: x = 3.141500e+000

* The "f" format code prints a number in floating-point format, in the default format:
[-]D.DDDDDD

For example:
x = 3.1415; printf("x = %f\n",x) yields: f = 3.141500

* The "g" format code prints a number in exponential or floating-point format, whichever is shortest.

* A numeric string may be inserted between the "%" and the format code to specify greater control over the output format. For example:
%3d
%5.2f
%08s
%-8.4s

This works as follows:

• The integer part of the number specifies the minimum "width", or number of spaces, the output will use, though the output may exceed that width if it is too long to fit.

• The fractional part of the number specifies either, for a string, the maximum number of characters to be printed; or, for floating-point formats, the number of digits to be printed to the right of the decimal point.

• A leading "-" specifies left-justified output. The default is right-justified output.

• A leading "0" specifies that the output be padded with leading zeroes to fill up the output field. The default is spaces.

For example, consider the output of a string:
x = "Baryshnikov"
printf("[%3s]\n",x) yields: [Baryshnikov]
printf("[%16s]\n",x) yields: [ Baryshnikov]
printf("[%-16s]\n",x) yields: [Baryshnikov ]
printf("[%.3s]\n",x) yields: [Bar]
printf("[%16.3s]\n",x) yields: [ Bar]
printf("[%-16.3s]\n",x) yields: [Bar ]
printf("[%016s]\n",x) yields: [00000Baryshnikov]
printf("[%-016s]\n",x) yields: [Baryshnikov ]

-- or an integer:
x = 312
printf("[%2d]\n",x) yields: [312]
printf("[%8d]\n",x) yields: [ 312]
printf("[%-8d]\n",x) yields: [312 ]
printf("[%.1d]\n",x) yields: [312]
printf("[%08d]\n",x) yields: [00000312]
printf("[%-08d]\n",x) yields: [312 ]

-- or a floating-point number:
x = 251.673209
printf("[%2f]\n",x) yields: [251.67309]
printf("[%16f]\n",x) yields: [ 251.67309]
printf("[%-16f]\n",x) yields: [251.67309 ]
printf("[%.3f]\n",x) yields: [251.673]
printf("[%16.3f]\n",x) yields: [ 251.673]
printf("[%016.3f]\n",x) yields: [00000000251.673]

----------

The keywords BEGIN and END are used to perform specific actions before and after reading the input lines. The BEGIN keyword is normally associated with printing titles and setting default values, whilst the END keyword is normally associated with printing totals

awk 'BEGIN {string = "Super" "power"; print string}'
this will print: Superpower

For example, to extract and print the word "get" from "unforgettable":
BEGIN {print substr("unforgettable",6,3)}

Please be aware that the first character of the string is numbered "1", not "0". To extract a substring of at most ten characters, starting from position 6 of the first field variable, you use:
substr($1,6,10)

Escape sequences
Sequence Description
\b Backspace
\f Formfeed
\n Newline
\r Carriage Return
\t Horizontal tab
\" Double quote
\a The "alert" character; usually the ASCII BEL character
\v Vertical tab

example: awk '{ print $0 "\n"}' myfile
add a new empty line after each line

Regular Expressions
Pattern searching similar to grep and other unix utilities:
/386/
$1 ~ /386/
In regular expressions, the following symbols are metacharacters with special meanings.
\ ^ $ . [ ] * + ? ( ) |

^ matches the first character of a string
$ matches the last character of a string
. matches a single character of a string
[ ] defines a set of characters
( ) used for grouping
| specifies alternatives

displays all which do not contain 2, 3, 4, 6 or 8 in first field
awk '$1 ~ /[^23468]/ { print $0 }'

How to hide special characters from the shell, this depends on the shell !
Preceding any single character with a backslash ('\') quotes that character.

Thus:
awk "BEGIN { print \"Don't Panic!\" }"
you get
tcsh: Unmatched '

but if you use bash, it works
With tcsh you need to write this:
awk 'BEGIN { print "Here is a single quote '\''" }'
the result is:
Here is a single quote '

Regular expressions are the extended kind found in egrep. They are composed of characters as follows:
c
matches the character c (assuming c is a character with no special meaning in regexps).
\c
matches the literal character c.
.
matches any character except newline.
^
matches the beginning of a line or a string.
$
matches the end of a line or a string.
[abc...]
matches any of the characters abc... (character class).
[^abc...]
matches any character except abc... and newline (negated character class).
r1|r2
matches either r1 or r2 (alternation).
r1r2
matches r1, and then r2 (concatenation).
r+
matches one or more r's.
r*
matches zero or more r's.
r?
matches zero or one r's.
(r)
matches r (grouping).

* The simplest kind search pattern that can be specified is a simple string, enclosed in forward-slashes ("/"). For example:
/The/

-- searches for any line that contains the string "The". This will not match "the" as Awk is "case-sensitive", but it will match words like "There" or "Them".

This is the crudest sort of search pattern. Awk defines special characters or "metacharacters" that can be used to make the search more specific. For example, preceding the string with a "^" tells Awk to search for the string at the beginning of the input line. For example:
/^The/

-- matches any line that begins with the string "The". Similarly, following the string with a "$" matches any line that ends with "The", for example:
/The$/

But what if you actually want to search the text for a character like "^" or "$"? Simple, just precede the character with a backslash ("\"). For example:
/\$/

-- matches any line with a "$" in it.

* Such a pattern-matching string is known as a "regular expression". There are many different characters that can be used to specify regular expressions. For example, it is possible to specify a set of alternative characters using square brackets ("[]"):
/[Tt]he/

This example matches the strings "The" and "the". A range of characters can also be specified. For example:
/[a-z]/

-- matches any character from "a" to "z", and:
/[a-zA-Z0-9]/

-- matches any letter or number.

A range of characters can also be excluded, by preceding the range with a "^". For example:
/^[^a-zA-Z0-9]/

-- matches any line that doesn't start with a letter or digit.

A "|" allows regular expressions to be logically ORed. For example:
/(^Germany)|(^Netherlands)/

-- matches lines that start with the word "Germany" or the word "Netherlands". Notice how parentheses are used to group the two expressions.

* The "." special characters allows "wildcard" matching, meaning it can be used to specify any arbitrary character. For example:
/wh./

-- matches "who", "why", and any other string that has the characters "wh" and any following character.

This use of the "." wildcard should be familiar to UN*X shell users, but awk interprets the "*" wildcard in a subtly different way. In the UN*X shell, the "*" substitutes for a string of arbitrary characters of any length, including zero, while in awk the "*" simply matches zero or more repetitions of the previous character or expression. For example, "a*" would match "a", "aa", "aaa", and so on. That means that ".*" will match any string of characters.

There are other characters that allow matches against repeated characters expressions. A "?" matches zero or one occurrences of the previous regular expression, while a "+" matches one or more occurrences of the previous regular expression. For example:
/^[+-]?[0-9]+$/

-- matches any line that consists only of a (possibly signed) integer number. This is a somewhat confusing example and it is helpful to break it down by parts:
/^ Find string at beginning of line.
/^[-+]? Specify possible "-" or "+" sign for number.
/^[-+]?[0-9]+ Specify one or more digits "0" through "9".
/^[-+]?[0-9]+$/ Specify that the line ends with the number.

The search can be constrained to a single field within the input line. For example:
$1 ~ /^France$/

-- searches for lines whose first field ("$1" -- more on "field variables" later) is the word "France", while:
$1 !~ /^Norway$/

-- searches for lines whose first field is not the word "Norway".

It is possible to search for an entire series or "block" of consecutive lines in the text, using one search pattern to match the first line in the block and another search pattern to match the last line in the block. For example:
/^Ireland/,/^Summary/

-- matches a block of text whose first line begins with "Ireland" and whose last line begins with "Summary".

NF == 0

-- matches all blank lines, or those whose number of fields is zero.
$1 == "France"

-- is a string comparison that matches any line whose first field is the string "France". The astute reader may notice that this example seems to do the same thing as a the previous example:
$1 ~ /^France$/

In fact, both examples do the same thing, but in the example immediately above the "^" and "$" metacharacters had to be used in the regular expression to specify a match with the entire first field; without them, it would match such strings as "FranceFour", "NewFrance", and so on. The string expression matches only to "France".

* It is also possible to combine several search patterns with the "&&" (AND) and "||" (OR) operators. For example:
((NR >= 30) && ($1 == "France")) || ($1 == "Norway")

-- matches any line past the 30th that begins with "France", or any line that begins with "Norway".

* One class of pattern-matching that wasn't listed above is performing a numeric comparison on a field variable. It can be done, of course; for example:
$1 == 100

-- matches any line whose first field has a numeric value equal to 100. This is a simple thing to do and it will work fine. However, suppose you want to perform:
$1 < 100

This will generally work fine, but there's a nasty catch to it, which requires some explanation. The catch is that if the first field of the input can be either a number or a text string, this sort of numeric comparison can give crazy results, matching on some text strings that aren't equivalent to a numeric value.

This is because awk is a "weakly-typed" language. Its variables can store a number or a string, with awk performing operations on each appropriately. In the case of the numeric comparison above, if $1 contains a numeric value, awk will perform a numeric comparison on it, as expected; but if $1 contains a text string, awk will perform a text comparison between the text string in $1 and the three-letter text string "100". This will work fine for a simple test of equality or inequality, since the numeric and string comparisons will give the same results, but it will give crazy results for a "less than" or "greater than" comparison.

Awk is not broken; it is doing what it is supposed to do in this case. If this problem comes up, it is possible to add a second test to the comparison to determine if the field contains a numeric value or a text string. This second test has the form:
(( $1 + 0 ) == $1 )

If $1 contains a numeric value, the left-hand side of this expression will add 0 to it, and awk will perform a numeric comparison that will always be true.

If $1 contains a text string that doesn't look like a number, for want of anything better to do awk will interpret its value as 0. This means the left-hand side of the expression will evaluate to zero; since there is a non-numeric text string in $1, awk will perform a string comparison that will always be false. This leads to a more workable comparison:
((( $1 + 0 ) == $1 ) && ( $1 > 100 ))

AWK Numerical Functions
Name Function
cos(x) Cosine with x in radians
exp(x) Exponent
int(x) Integer part of x truncated towards 0
log(x) Logarithm (natural logarithm of x )
sin(x) Sine with x in radians
sqrt(x) Square Root
atan2(y,x) Arctangent of y/x in radians
rand() Random
srand(x) Seed Random

awk 'BEGIN { for (i = 1; i <= 7; i++) print int(101 * rand()) }'
This program prints 7 random numbers from 0 to 100, inclusive.

awk '{print sqrt($1)}' myfile
Print the square root for numbers in field 1

rand()
This gives you a random number. The values of rand are uniformly-distributed between 0 and 1. The value is never 0 and never 1. Often you want random integers instead. Here is a user-defined function you can use to obtain a random nonnegative integer less than n:
function randint(n) {
return int(n * rand())
}

The multiplication produces a random real number greater than 0 and less than n. We then make it an integer (using int) between 0 and n - 1. Here is an example where a similar function is used to produce random integers between 1 and n. Note that this program will print a new random number for each input record.
awk '
# Function to roll a simulated die.
function roll(n) { return 1 + int(rand() * n) }

# Roll 3 six-sided dice and print total number of points.
{
printf("%d points\n", roll(6)+roll(6)+roll(6))
}'

Note: rand starts generating numbers from the same point, or seed, each time you run awk. This means that a program will produce the same results each time you run it. The numbers are random within one awk run, but predictable from run to run. This is convenient for debugging, but if you want a program to do different things each time it is used, you must change the seed to a value that will be different in each run. To do this, use srand.
srand(x)
The function srand sets the starting point, or seed, for generating random numbers to the value x. Each seed value leads to a particular sequence of "random" numbers. Thus, if you set the seed to the same value a second time, you will get the same sequence of "random" numbers again. If you omit the argument x, as in srand(), then the current date and time of day are used for a seed. This is the way to get random numbers that are truly unpredictable. The return value of srand is the previous seed. This makes it easy to keep track of the seeds for use in consistently reproducing sequences of random numbers.

String Functions
index(string,search)
length(string)
split(string,array,separator)
substr(string,position)
substr(string,position,max)
sub(regex,replacement)
sub(regex,replacement,string)
gsub(regex,replacement)
gsub(regex,replacement,string)
match(string,regex)
tolower(string)
toupper(string)
system(cmd-line)
Execute the command cmd-line, and return the exit status

Example
The string function gsub to replace each occurrence of 286 with the string AT
awk '{ gsub( /286/, "AT" ); print $0 }' myfile

awk '{print tolower($0)}' myfile

If myfile contains:
50.211 14.979 24.196
50.142 15.162 25.415

awk '{split($0,a," "); print a[1]}' myfile
will give :
50.211
50.142

if i do this only on line 1:
awk 'NR ==1 {split($0,a," "); print a[1]}' 2points
i get:
50.211

If the myfile contains:
Processing NGC 2345

awk '{print substr($0,12,8)}' myfile
will give: NGC 2345

The "split()" function has the syntax:
split(<string>,<array>,[<field separator>])

This function takes a string with n fields and stores the fields into array[1], array[2], ... , array[n]. If the optional field separator is not specified, the value of FS (normally "white space", the space and tab characters) is used. For example, suppose we have a field of the form:
joe:frank:harry:bill:bob:sil

We could use "split()" to break it up and print the names as follows:
my_string = "joe:frank:harry:bill:bob:sil";
split(my_string,names,":");
print names[1];
print names[2];
...

The "index()" function has the syntax:
index(<target string>,<search string>)

-- and returns the position at which the search string begins in the target string (remember, the initial position is "1"). For example:
index("gorbachev","bach") returns: 4
index("superficial","super") returns: 1
index("sunfire","fireball") returns: 0
index("aardvark","z") returns: 0

Match(string, regexp)
The match function searches the string, string, for the longest, leftmost substring matched by the regular expression, regexp. It returns the character position, or index, of where that substring begins (1, if it starts at the beginning of string). If no match if found, it returns 0. The match function sets the built-in variable RSTART to the index. It also sets the built-in variable RLENGTH to the length in characters of the matched substring. If no match is found, RSTART is set to 0, and RLENGTH to -1. For example:
awk '{
if ($1 == "FIND")
regex = $2
else {
where = match($0, regex)
if (where)
print "Match of", regex, "found at", where, "in", $0
}
}'

This program looks for lines that match the regular expression stored in the variable regex. This regular expression can be changed. If the first word on a line is FIND, regex is changed to be the second word on that line. Therefore, given:
FIND fo*bar
My program was a foobar
But none of it would doobar
FIND Melvin
JF+KM
This line is property of The Reality Engineering Co.
This file created by Melvin.

awk prints:
Match of fo*bar found at 18 in My program was a foobar
Match of Melvin found at 26 in This file created by Melvin.

split(string, array, fieldsep)
This divides string into pieces separated by fieldsep, and stores the pieces in array. The first piece is stored in array[1], the second piece in array[2], and so forth. The string value of the third argument, fieldsep, is a regexp describing where to split string (much as FS can be a regexp describing where to split input records). If the fieldsep is omitted, the value of FS is used. split returns the number of elements created. The split function, then, splits strings into pieces in a manner similar to the way input lines are split into fields. For example:
split("auto-da-fe", a, "-")

splits the string `auto-da-fe' into three fields using `-' as the separator. It sets the contents of the array a as follows:
a[1] = "auto"
a[2] = "da"
a[3] = "fe"

The value returned by this call to split is 3. As with input field-splitting, when the value of fieldsep is " ", leading and trailing whitespace is ignored, and the elements are separated by runs of whitespace.
sprintf(format, expression1,...)
This returns (without printing) the string that printf would have printed out with the same arguments (see section Using printf Statements for Fancier Printing). For example:
sprintf("pi = %.2f (approx.)", 22/7)

returns the string "pi = 3.14 (approx.)".
sub(regexp, replacement, target)
The sub function alters the value of target. It searches this value, which should be a string, for the leftmost substring matched by the regular expression, regexp, extending this match as far as possible. Then the entire string is changed by replacing the matched text with replacement. The modified string becomes the new value of target. This function is peculiar because target is not simply used to compute a value, and not just any expression will do: it must be a variable, field or array reference, so that sub can store a modified value there. If this argument is omitted, then the default is to use and alter $0. For example:
str = "water, water, everywhere"
sub(/at/, "ith", str)

sets str to "wither, water, everywhere", by replacing the leftmost, longest occurrence of 'at' with 'ith'. The sub function returns the number of substitutions made (either one or zero). If the special character '&' appears in replacement, it stands for the precise substring that was matched by regexp. (If the regexp can match more than one string, then this precise substring may vary.) For example:
awk '{ sub(/candidate/, "& and his wife"); print }'

changes the first occurrence of 'candidate' to 'candidate and his wife' on each input line. Here is another example:
awk 'BEGIN {
str = "daabaaa"
sub(/a*/, "c&c", str)
print str
}'

prints 'dcaacbaaa'. This show how '&' can represent a non-constant string, and also illustrates the "leftmost, longest" rule. The effect of this special character ('&') can be turned off by putting a backslash before it in the string. As usual, to insert one backslash in the string, you must write two backslashes. Therefore, write '\\&' in a string constant to include a literal '&' in the replacement. For example, here is how to replace the first `|' on each line with an '&':
awk '{ sub(/\|/, "\\&"); print }'

Note: as mentioned above, the third argument to sub must be an lvalue. Some versions of awk allow the third argument to be an expression which is not an lvalue. In such a case, sub would still search for the pattern and return 0 or 1, but the result of the substitution (if any) would be thrown away because there is no place to put it. Such versions of awk accept expressions like this:
sub(/USA/, "United States", "the USA and Canada")

But that is considered erroneous in gawk.
gsub(regexp, replacement, target)
This is similar to the sub function, except gsub replaces all of the longest, leftmost, nonoverlapping matching substrings it can find. The 'g' in gsub stands for "global," which means replace everywhere. For example:
awk '{ gsub(/Britain/, "United Kingdom"); print }'

replaces all occurrences of the string 'Britain' with 'United Kingdom' for all input records. The gsub function returns the number of substitutions made. If the variable to be searched and altered, target, is omitted, then the entire input record, $0, is used. As in sub, the characters '&' and '\' are special, and the third argument must be an lvalue.
substr(string, start, length)
This returns a length-character-long substring of string, starting at character number start. The first character of a string is character number one. For example, substr("washington", 5, 3) returns "ing". If length is not present, this function returns the whole suffix of string that begins at character number start. For example, substr("washington", 5) returns "ington". This is also the case if length is greater than the number of characters remaining in the string, counting from character number start.
tolower(string)
This returns a copy of string, with each upper-case character in the string replaced with its corresponding lower-case character. Nonalphabetic characters are left unchanged. For example, tolower("MiXeD cAsE 123") returns "mixed case 123".
toupper(string)
This returns a copy of string, with each lower-case character in the string replaced with its corresponding upper-case character. Nonalphabetic characters are left unchanged. For example, toupper("MiXeD cAsE 123") returns "MIXED CASE 123".

The Match function
getline
getline <file
getline variable
getline variable <file
awk provides the function getline to read input from the current input file or from a file or pipe.
getline reads the next input line, splitting it into fields according to the settings of NF, NR and FNR. It returns 1 for success, 0 for end-of-file, and -1 on error.
The statement
getline < "temp.dat"
reads the next input line from the file "temp.dat", field splitting is performed, and NF is set.

The statement
getline data < "temp.dat"
reads the next input line from the file "temp.dat" into the user defined variable data, no field splitting is done, and NF, NR and FNR are not altered.

You can take input from keyboard while running awk script, try the following awk script:
awk 'BEGIN {print "your name"; getline na <"-"; print "my name is " na}'
Here getline function is used to read input from keyboard and then assign the data (inputted from keyboard) to variable.
Syntax:
getline variable-name < "-"
| | |
1 2 3

1 --> getline is function name
2 --> variable-name is used to assign the value read from input
3 --> Means read from stdin (keyboard)

Function Definition Example
Here is an example of a user-defined function, called myprint, that takes a number and prints it in a specific format.
function myprint(num)
{
printf "%6.3g\n", num
}

To illustrate, here is an awk rule which uses our myprint function:
$3 > 0 { myprint($3) }

This program prints, in our special format, all the third fields that contain a positive number in our input. Therefore, when given:
1.2 3.4 5.6 7.8
9.10 11.12 -13.14 15.16
17.18 19.20 21.22 23.24

this program, using our function to format the results, prints:
5.6
21.2

Here is an example of a recursive function. It prints a string backwards:
function rev (str, len) {
if (len == 0) {
printf "\n"
return
}
printf "%c", substr(str, len, 1)
rev(str, len - 1)
}

Calling User-defined Functions
Calling a function means causing the function to run and do its job. A function call is an expression, and its value is the value returned by the function.
A function call consists of the function name followed by the arguments in parentheses. What you write in the call for the arguments are awk expressions; each time the call is executed, these expressions are evaluated, and the values are the actual arguments. For example, here is a call to foo with three arguments (the first being a string concatenation):
foo(x y, "lose", 4 * z)

Caution: whitespace characters (spaces and tabs) are not allowed between the function name and the open-parenthesis of the argument list. If you write whitespace by mistake, awk might think that you mean to concatenate a variable with an expression in parentheses. However, it notices that you used a function name and not a variable name, and reports an error.

When a function is called, it is given a copy of the values of its arguments. This is called call by value. The caller may use a variable as the expression for the argument, but the called function does not know this: it only knows what value the argument had. For example, if you write this code:
foo = "bar"
z = myfunc(foo)

then you should not think of the argument to myfunc as being "the variable foo." Instead, think of the argument as the string value, "bar".

If the function myfunc alters the values of its local variables, this has no effect on any other variables. In particular, if myfunc does this:
function myfunc (win) {
print win
win = "zzz"
print win
}

to change its first argument variable win, this does not change the value of foo in the caller. The role of foo in calling myfunc ended when its value, "bar", was computed. If win also exists outside of myfunc, the function body cannot alter this outer value, because it is shadowed during the execution of myfunc and cannot be seen or changed from there.

However, when arrays are the parameters to functions, they are not copied. Instead, the array itself is made available for direct manipulation by the function. This is usually called call by reference. Changes made to an array parameter inside the body of a function are visible outside that function. This can be very dangerous if you do not watch what you are doing. For example:
function changeit (array, ind, nvalue) {
array[ind] = nvalue
}

BEGIN {
a[1] = 1 ; a[2] = 2 ; a[3] = 3
changeit(a, 2, "two")
printf "a[1] = %s, a[2] = %s, a[3] = %s\n", a[1], a[2], a[3]
}

prints 'a[1] = 1, a[2] = two, a[3] = 3', because calling changeit stores "two" in the second element of a.

The return Statement
The body of a user-defined function can contain a return statement. This statement returns control to the rest of the awk program. It can also be used to return a value for use in the rest of the awk program. It looks like this:
return expression

The expression part is optional. If it is omitted, then the returned value is undefined and, therefore, unpredictable.

A return statement with no value expression is assumed at the end of every function definition. So if control reaches the end of the function body, then the function returns an unpredictable value. awk will not warn you if you use the return value of such a function; you will simply get unpredictable or unexpected results.

Here is an example of a user-defined function that returns a value for the largest number among the elements of an array:
function maxelt (vec, i, ret) {
for (i in vec) {
if (ret == "" || vec[i] > ret)
ret = vec[i]
}
return ret
}

You call maxelt with one argument, which is an array name. The local variables i and ret are not intended to be arguments; while there is nothing to stop you from passing two or three arguments to maxelt, the results would be strange. The extra space before i in the function parameter list is to indicate that i and ret are not supposed to be arguments. This is a convention which you should follow when you define functions.

Here is a program that uses our maxelt function. It loads an array, calls maxelt, and then reports the maximum number in that array:
awk '
function maxelt (vec, i, ret) {
for (i in vec) {
if (ret == "" || vec[i] > ret)
ret = vec[i]
}
return ret
}

# Load all fields of each record into nums.
{
for(i = 1; i <= NF; i++)
nums[NR, i] = $i
}

END {
print maxelt(nums)
}'

Given the following input:
1 5 23 8 16
44 3 5 2 8 26
256 291 1396 2962 100
-6 467 998 1101
99385 11 0 225

the program tells us that:
99385
is the largest number in our array.

awk Control Flow Statements
if ( expression ) statement1 else statement2

while ( expression ) statement

for ( expression1; expression; expression2 ) statement

The syntax of "if ... else" is:
if (<condition>) <action 1> [else <action 2>]

The "else" clause is optional. The "condition" can be any expression discussed in the section on pattern matching, including matches with regular expressions.
For example, consider the following Awk program:
{if ($1=="green") print "GO";
else if ($1=="yellow") print "SLOW DOWN";
else if ($1=="red") print "STOP";
else print "WHAT";}

The syntax for "while" is:
while (<condition>) <action>

The "action" is performed as long the "condition" tests true, and the "condition" is tested before each iteration. The conditions are the same as for the "if ... else" construct. For example, since by default an Awk variable has a value of 0, the following Awk program could print the numbers from 1 to 20:
BEGIN {while(++x<=20) print x}

* The "for" loop is more flexible. It has the syntax:
for (<initial action>;<condition>;<end-of-loop action>) <action>

For example, the following "for" loop prints the numbers 10 through 20 in increments of 2:
BEGIN {for (i=10; i<=20; i+=2) print i}

This is equivalent to:
i=10
while (i<=20) {
print i;
i+=2;}

The "for" loop has an alternate syntax, used when scanning through an array:
for (<variable> in <array>) <action>

with the example:
my_string = "joe:frank:harry:bill:bob:sil";
split(my_string, names, ":");

-- then the names could be printed with the following statement:
for (idx in names) print idx, names[idx];

This yields:
2 frank
3 harry
4 bill
5 bob
6 sil
1 joe

Notice that the names are not printed in the proper order. One of the characteristics of this type of "for" loop is that the array is not scanned in a predictable order.

Awk defines three unconditional control statements: "break", "continue", "next", and "exit". "Break" and "continue" are strictly associated with the "while" and "for" loops:

• break: Causes a jump out of the loop.

• continue: Forces the next iteration of the loop.

"Next" and "exit" control Awk's input scanning:

• next: Causes Awk to immediately get another line of input and begin scanning it from the first match statement.

• exit: Causes Awk to end reading its input and execute END operations, if any are specified.

Limits
Each implementation of awk imposes some limits. Below are typical limits
100 fields
2500 characters per input line
2500 characters per output line
1024 characters per individual field
1024 characters per printf string
400 characters maximum quoted string
400 characters in character class
15 open files
1 pipe

EXAMPLES
There are millions of different ways to do things...here are few examples

The simplest action is to print some or all of a record; this is accomplished by the awk command print.
The awk program

awk '{ print }' myfile
Prints each record
while
awk '{print $2, $1}' myfile
prints the first two fields in reverse order but
awk '{print $1 $2}' myfile
will group the 2 fields

awk '{ print $1 >"foo1"; print $2 >"foo2" }' myfile
will put the data into file foo1 and file foo2

The variables OFS and ORS may be used to change the current output field separator and output record separator. The output record separator is appended to the output of the print statement.
Awk also provides the printf statement for output formatting.

BEGIN and END
The special pattern BEGIN matches the beginning of the input, before the first record is read. The pattern END matches the end of the input, after the last record has been processed. BEGIN and END thus provide a way to gain control before and after processing.

-------

awk '{ if ($2 =="0.5") {print $0} }' myfile
prints the lines for which field 2 = 0.5

Test count things:
awk 'BEGIN {counter = 0} {if ($2 == "0.5"){counter++}} END {print counter} ' myfile
this tells me how many times field 2 has a value of 0.5

Using Awk to create a simple histogram
We have a file with scores in a file called mydata
r 0.2 99
r 0.1 88
r 0.4 76
r 0.1 76
r 0.2 56
r 0.3 900
r 0.2 43
r 0.5 5
r 0.5 9
r 0.6 56
r 0.8 43
r 0.7 33
r 0.9 10

we can sort the second column with :

sort +1 -n mydata > mydata_sorted

this gives:
r 0.1 76
r 0.1 88
r 0.2 43
r 0.2 56
r 0.2 99
r 0.3 900
r 0.4 76
r 0.5 5
r 0.5 9
r 0.6 56
r 0.7 33
r 0.8 43
r 0.9 10

(sorting can be done in descending (reverse) order with sort -nr)

You can put the following lines in a file called histo.txt
to Print frequency histogram of column (field 2) of numbers
$2 <= 0.1 {na=na+1}
($2 > 0.1) && ($2 <= 0.2) {nb = nb+1}
($2 > 0.2) && ($2 <= 0.3) {nc = nc+1}
($2 > 0.3) && ($2 <= 0.4) {nd = nd+1}
($2 > 0.4) && ($2 <= 0.5) {ne = ne+1}
($2 > 0.5) && ($2 <= 0.6) {nf = nf+1}
($2 > 0.6) && ($2 <= 0.7) {ng = ng+1}
($2 > 0.7) && ($2 <= 0.8) {nh = nh+1}
($2 > 0.8) && ($2 <= 0.9) {ni = ni+1}
($2 > 0.9) {nj = nj+1}
END {print na, nb, nc, nd, ne, nf, ng, nh, ni, nj, NR}

and run

awk -f histo.txt mydata_sorted
this will give:
2 3 1 1 2 1 1 1 1 13
meaning, 0.1 occurs twice, 0.9, once

COUNTING score values after docking

awk 'BEGIN{
counter = 0
}
{
if ($3 >= 0 && $3 <= 1 ){
counter++
}
}
END{
print "Scores_0_to_1 " counter
}' mylistwithnameshort_test.txt

awk 'BEGIN{
counter = 0
}
{
if ($3 >= 1 && $3 <= 2 ){
counter++
}
}
END{
print "Scores_1_to_2 " counter
}' mylistwithnameshort_test.txt

awk 'BEGIN{
counter = 0
}
{
if ($3 >= 2 && $3 <= 3 ){
counter++
}
}
END{
print "Scores_2_to_3 " counter
}' mylistwithnameshort_test.txt

awk 'BEGIN{
counter = 0
}
{
if ($3 >= 3 && $3 <= 4 ){
counter++
}
}
END{
print "Scores_3_to_4 " counter
}' mylistwithnameshort_test.txt

awk 'BEGIN{
counter = 0
}
{
if ($3 >= 4 && $3 <= 5 ){
counter++
}
}
END{
print "Scores_4_to_5 " counter
}' mylistwithnameshort_test.txt

awk 'BEGIN{
counter = 0
}
{
if ($3 >= 5 && $3 <= 6 ){
counter++
}
}
END{
print "Scores_5_to_6 " counter
}' mylistwithnameshort_test.txt

awk 'BEGIN{
counter = 0
}
{
if ($3 >= 6 && $3 <= 7 ){
counter++
}
}
END{
print "Scores_6_to_7 " counter
}' mylistwithnameshort_test.txt

awk 'BEGIN{
counter = 0
}
{
if ($3 >= 7 && $3 <= 8 ){
counter++
}
}
END{
print "Scores_7_to_8 " counter
}' mylistwithnameshort_test.txt

awk 'BEGIN{
counter = 0
}
{
if ($3 >= 8 && $3 <= 9 ){
counter++
}
}
END{
print "Scores_8_to_9 " counter
}' mylistwithnameshort_test.txt

awk 'BEGIN{
counter = 0
}
{
if ($3 >= 9 && $3 <= 10 ){
counter++
}
}
END{
print "Scores_9_to_10 " counter
}' mylistwithnameshort_test.txt

awk 'BEGIN{
counter = 0
}
{
if ($3 >= 10 && $3 <= 11 ){
counter++
}
}
END{
print "Scores_10_to_11 " counter
}' mylistwithnameshort_test.txt

awk 'BEGIN{
counter = 0
}
{
if ($3 >= 11 && $3 <= 12 ){
counter++
}
}
END{
print "Scores_11_to_12 " counter
}' mylistwithnameshort_test.txt

awk 'BEGIN{
counter = 0
}
{
if ($3 >= 12 && $3 <= 13 ){
counter++
}
}
END{
print "Scores_12_to_13 " counter
}' mylistwithnameshort_test.txt

awk 'BEGIN{
counter = 0
}
{
if ($3 >= 13 && $3 <= 14 ){
counter++
}
}
END{
print "Scores_13_to_14 " counter
}' mylistwithnameshort_test.txt

awk 'BEGIN{
counter = 0
}
{
if ($3 >= 14 && $3 <= 15 ){
counter++
}
}
END{
print "Scores_14_to_15 " counter
}' mylistwithnameshort_test.txt

awk 'BEGIN{
counter = 0
}
{
if ($3 >= 15 && $3 <= 20 ){
counter++
}
}
END{
print "Scores_15_to_20 " counter
}' mylistwithnameshort_test.txt

-------

echo "the script starts"
echo "check that each compound starts with ISIS"
echo -n "select the SDF file:"
read file1
echo "the name of the file is $file1"
tr '\r' '\n' < "$file1" > "tmp1_unix.sdf"
echo "step 1"
awk 'NF > 0' < "tmp1_unix.sdf" > "tmp2_unix_no_emptylines.sdf"
echo "this is done"

Word frequency
Print one word per line
xargs -n 1 < toto.txt
or
awk 'BEGIN{RS=" "} 1' myfile
or
awk -v OFS='\n' '{$1=$1}1' myfile

Then, something like this:
cat myfile | xargs -n 1 | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -n
or
cat myfile | xargs -n 1 | tr '[A-Z]' '[a-z]' | sort | uniq -c | sort -n

Rename files
#!/bin/sh
# we have less than 3 arguments. Print the help text:
if [ $# -lt 3 ] ; then
cat <<HELP
ren -- renames a number of files using sed regular expressions

USAGE: ren 'regexp' 'replacement' files...

EXAMPLE: rename all *.HTM files in *.html:
ren 'HTM$' 'html' *.HTM

HELP
exit 0
fi
OLD="$1"
NEW="$2"
# The shift command removes one argument from the list of
# command line arguments.
shift
shift
# $* contains now all the files:
for file in $*; do
if [ -f "$file" ] ; then
newfile=`echo "$file" | sed "s/${OLD}/${NEW}/g"`
if [ -f "$newfile" ]; then
echo "ERROR: $newfile exists already"
else
echo "renaming $file to $newfile ..."
mv "$file" "$newfile"
fi
fi
done

Rename files
myfiles=`ls toto*`
for i in $myfiles ; do

echo "mv $i $i"\.smi
done

Some other examples
# Print first two fields in opposite order:
awk '{ print $2, $1 }' file

# Print lines longer than 72 characters:
awk 'length > 72' file

# Print length of string in 2nd column
awk '{print length($2)}' file

# Add up first column, print sum and average:
{ s += $1 }
END { print "sum is", s, " average is", s/NR }

# Print fields in reverse order:
awk '{ for (i = NF; i > 0; --i) print $i }' file

# Print the last line
{line = $0}
END {print line}

# Print the total number of lines that contain the word Pat
/Pat/ {nlines = nlines + 1}
END {print nlines}

# Print all lines between start/stop pairs:
awk '/start/, /stop/' file

# Print all lines whose first field is different from previous one:
awk '$1 != prev { print; prev = $1 }' file

# Print column 3 if column 1 > column 2:
awk '$1 > $2 {print $3}' file

# Print line if column 3 > column 2:
awk '$3 > $2' file

# Count number of lines where col 3 > col 1
awk '$3 > $1 {print i + "1"; i++}' file

# Print sequence number and then column 1 of file:
awk '{print NR, $1}' file

# Print every line after erasing the 2nd field
awk '{$2 = ""; print}' file

# Print hi 28 times
yes | head -28 | awk '{ print "hi" }'

# Print hi.0010 to hi.0099 (NOTE IRAF USERS!)
yes | head -90 | awk '{printf("hi00%2.0f \n", NR+9)}'

# Print out 4 random numbers between 0 and 1
yes | head -4 | awk '{print rand()}'

# Print out 40 random integers modulo 5
yes | head -40 | awk '{print int(100*rand()) % 5}'

# Replace every field by its absolute value
{ for (i = 1; i <= NF; i=i+1) if ($i < 0) $i = -$i print}

# If you have another character that delimits fields, use the -F option
# For example, to print out the phone number for Jones in the following file,
# 000902|Beavis|Theodore|333-242-2222|149092
# 000901|Jones|Bill|532-382-0342|234023
# ...
# type
awk -F"|" '$2=="Jones"{print $4}' filename

# Some looping commands
# Remove a bunch of print jobs from the queue
BEGIN{
for (i=875;i>833;i--){
printf "lprm -Plw %d\n", i
} exit
}

Formatted printouts are of the form printf( "format\n", value1, value2,
... valueN)
e.g. printf("howdy %-8s What it is bro. %.2f\n", $1, $2*$3)
%s = string
%-8s = 8 character string left justified
%.2f = number with 2 places after .
%6.2f = field 6 chars with 2 chars after .
\n is newline
\t is a tab

---------
Transform multiple white spaces (-t Columns are delimited with whitespace, -s Specify a set of characters to be used to delimit columns for the -t option)
Not fully ok on Mac
column -t myfile.txt

If several white space in between columns, awk also
awk '{$2=$2};1' myfile.txt

After change single space to tab (needs the above on Mac as some problems with sed, awk etc, unless using gsed..etc)
awk '{gsub(" ","\t",$0); print;}'

At the end something like this:
awk '{$2=$2};1' TEST-space.txt | awk '{gsub(" ","\t",$0); print;}'

Write PubMed results reverse order
Reverse order of block of text separated by empty lines - Warning need to have tac (reverse of cat)
On mac comes with coreutils (something like this to install and have gtac, brew install coreutils)
gtac TEST-pubmed.txt | perl -00 -lpe '$_ = join "\n", reverse split /\n/' > toto_reverse.txt

Remove numbering (digits or numbers) in the pubmed_result file (on a Mac, it seems it requires to apply sed twice)
cat pubmed_result.txt | sed 's/^[[:digit:]]\://g' | sed 's/^[0-9].://g'

merge block of lines into one line
awk '{$1=$1; sub(/$/,""); print}' RS= toto_reverse.txt > toutou.txt

insert empty line in between each line
sed 'G' toutou.txt > tutu.txt

delete first field meaning the numbering of each paper
awk '{$1=""; print $0}' tutu.txt > zozo.txt

add number in front of each line (default, number and tab text)
nl zozo.txt > zozo_end.txt

remove from DOI to end of line
sed 's/doi:.*$//g' zozo_end.txt > zozo_no_DOI.txt

Or one step less with all combined (of course can be done in less steps, this is also to illustrate different small tricks):
cat pubmed_result.txt | awk '{$1=$1; sub(/$/,""); print}' RS= | gtac | sed 'G' | awk '{$1=""; print $0}' | nl | sed 's/doi:.*$//g' > PUBmed_result_Reversed.txt

Unit converstion etc
The file has 3 fields (remove headers to avoid problems)
name solubility(mg/L) MW

To deal with this solubility question, we do it step by step
divide by 1000 to get g/L and divide by MW to get mol/L

Step 1, divide by 1000:
cat file | awk -v OFS='\t' '{print $1, $2/1000, $3}' > test1.txt

Step2, divide field 2 by field 3 MW and create field 4, separated by a tab
awk -v OFS='\t' '{$4 = $2 / $3}1' myfile

Here
cat file | awk -v OFS='\t' '{print $1, $2/1000, $3}' | awk -v OFS='\t' '{$4 = $2 / $3}1' > test2.txt

Step3, need log base 10 of column 4
awk -v OFS='\t' '{print $1, $2, $3, log($4)/log(10)}' > test3.txt

here
cat file | awk -v OFS='\t' '{print $1, $2/1000, $3}' | awk -v OFS='\t' '{$4 = $2 / $3}1' | awk -v OFS='\t' '{print $1, $2, $3, log($4)/log(10)}' > test3.txt

Add field before first field

awk -v OFS='\t' '{ $1 = (NR==1?"ID":0) OFS $1 } 1'

Shift field order - reordering

awk -v OFS='\t' '{print $3, $2, $4, $1}'

Web scraping
Imagine 4 pages
https://mysite/myfolder/index?page=1
https://mysite/myfolder/index?page=2
https://mysite/myfolder/index?page=3
https://mysite/myfolder/index?page=4

page='curl https://mysite/myfolder/index?page=' ; for (( i = 1; i <= 4; i++ )); do echo $page${i} ; done
I can write this in a file and run to fetch the html pages

or run directly
curl `page='https://mysite/myfolder/index?page=' ; for (( i = 1; i <= 4; i++ )); do echo $page${i} ; done` >> myHTML.html

Then if needed while read line loop
while read LINE; do echo "$LINE" | grep pattern ; done < myfile

--------some Python toolkits----------

Pocket-space maps to identify novel binding-site conformations in proteins
Epock: rapid analysis of protein pocket dynamics
Polyphony: superposition independent methods for ensemble-based drug discovery
Pytim: A python package for the interfacial analysis of molecular simulations
Biotite: a unifying open source computational biology framework in Python

MyMolDB: a micromolecular database solution with open source and free components
Scoria: a Python module for manipulating 3D molecular data

Clean gmail

size:15m older_than:5y

----------

Rescoring ideas

-Split with babel
babel output_top_docked_1500.mol2 file_name_output.mol2 -m

-Compress the original mol2 to avoid problem with the loop
gzip output_top_docked_1500.mol2

-Check number of mol2 in this directory
ls *.mol2 | wc -l

-Select only wanted files for the loop to rescore
for i in file_name*.mol2; do ./dligand2.gnu -P protein_with_H.pdb -L $i >> myresults.txt ; done

-Get the name of the file (many other ways to do that...here we keep it simple)
for i in file_name*.mol2; do echo $i >> compound_file_number_after_split_babel.txt; done

-Check with wc -l of compound_file_number_after_split_babel.txt if the number of files matches

-Combine the two files
paste myresults.txt compound_file_number_after_split_babel.txt > ligand_rescored.txt

gives:
-9.97393 ...output1000.mol2
-9.79917 ...output1001.mol2
-8.33607 ...output1002.mol2

-SORT
If you have recent GNU you can use sort -V, this is lexicographic order except that sequences of digits are ordered according to their value as an integer in decimal notation.
sort -V

Sort on second field, name of the files
sort -V -k2,2 ligand_rescored.txt > sorted1.txt

gives:
-7.32387 bruno_PD1_drug_MTI_vina_output1.mol2
-7.24097 bruno_PD1_drug_MTI_vina_output2.mol2
-6.82052 bruno_PD1_drug_MTI_vina_output3.mol2
-10.0055 bruno_PD1_drug_MTI_vina_output4.mol2

Or sort on the scores
sort -n -k1,1 ligand_rescored.txt > SORTED_on_score.txt
gives:
-14.0474 ...output3620.mol2
-14.1356 ...output3621.mol2
-13.0038 ...output2814.mol2
-13.0081 ...output4282.mol2
-13.0405 ...output2567.mol2

-Cat the files in the right order to generate big files sorted by new scores
filename_energysorted=`awk '{print $2}' SORTED_on_score.txt` ; echo "$filename_energysorted"

filename_energysorted=`awk '{print $2}' SORTED_on_score.txt` ; cat $filename_energysorted >> bigfile_rescored_energy_sorted.mol2

Some processing of MTiOpenScreen ideas (check that Babel does not miss some compounds...same number of molecules in all files..)

Output babel from docked Mol2 file, 3 best poses per compound, MTiOpenScreen
awk 'NR % 3 == 0' smiles_of_docked_1500_cmpds.txt > TEST_only_best_smiles_docked.txt

add header
awk 'BEGIN{printf("Smiles\n");} {print;}' TEST_only_best_smiles_docked.txt | awk '{print $1}' > smiles_info_ready_to_merge.txt

Paste two files to have them ready for DataWarrior, delimiter will be coma
paste -d"," smiles_info_ready_to_merge.txt output.table.csv > Output_table_smiles_docked_energy_ready_Warrior.csv

compare two files side by side
sdiff file1 file2

Idea to process mutation files
for file in mutation_D2.txt mutation_p3.txt mutation_l4.txt mutation_r5.txt mutation_E6.txt mutation_R7.txt mutation_A20.txt; do cat $file | awk '{print $5 , $7}' >> ALL.txt ; done

create MUT file 1, 2.. when empty lines are found, thus groups of lines stay together in a file, then split and put the second group of lines in a second file and so on
awk 'BEGIN{i++} !NF{++i;next} {print > "MUT"i".txt"}' ALL.txt

Repeat 10 times
yes Hello | head -n 10

Compare files - find lines in one file not present in the second file (warning about white space around and extra lines and check sorting)
grep -v -f file1.txt file2.txt
Possibly faster, to check if sorting is needed before
grep -F -x -v -f file2 file1

grep -v -f F1.txt F2.txt
print what is in file F2 not in F1

Also run the opposite order to files to check

This may work also
for i in $(cat file2);do if [ $(grep -i $i file1) ];then echo "$i found" >>Matching_lines.txt;else echo "$i missing" >>missing_lines.txt ;fi;done

Find lines only in file1
comm -23 file1 file2

Find lines only in file2
comm -13 file1 file2

Find lines common to both files
comm -12 file1 file2

Find strings in a file and grep it in another file
grep -f mynames.txt grepIT-inthisfile.txt > lines_found.txt

Lowercase text in file or toupper()
awk '{print tolower($0)}'

Compare antiviral_cmpds.txt and in_my_base.txt
Remove space between names and put all in lower case
cat antiviral_cmpds.txt | tr ' ' '-' | tr '[:upper:]' '[:lower:]' > antiviral_cmpds_clean.txt
cat in_my_base.txt | tr ' ' '-' | tr '[:upper:]' '[:lower:]' > in_my_base_clean.txt

Then to compare the unsorted files (the grep -i is not needed as to ignore case but still kept here)
for i in $(cat antiviral_cmpds_clean.txt);do if [ $(grep -i $i in_my_base_clean.txt) ];then echo "$i found in both files" >> Matching_compound_lines_in_both_files.txt;else echo "$i missing in my_base_clean" >> Missing_compounds_in_my_base_clean.txt ;fi;done

for i in $(cat in_my_base_clean.txt);do if [ $(grep -i $i antiviral_cmpds_clean.txt) ];then echo "$i found in both files" >> Matching_compound_lines_in_both_files_again.txt;else echo "$i missing in antiviral_cmpds_clean" >> Missing_compounds_in_antiviral_cmpds_clean.txt ;fi;done

Lowercase all but the first column
awk '{out=""; for (i=2; i<=NF; i++) out=out" "tolower($i); print $1out}' file.txt

Remove lines that are less than 20 characters
awk 'length>=20' file.txt

Remove lines that are more than 20 characters
awk 'length<=20' file.txt

Add text in front field 3 (tabulated file)
awk 'BEGIN { OFS = "\t" } { $3 = "zz-" $3; print }' file.txt

Count empty line in files
grep -cx '[[:space:]]*' *.txt

This prints every line in which the 3rd tab-delimited column is not blank or if not tab...
awk -F'\t' '$3 != ""' myfile
awk '$3 != ""' myfile

Append text at the end of each word of a line
sed 's/$/-MyText/' myfile

Truncating numbers with awk, formatting before the decimal and rounding after

awk '{printf "%2.3f\n",$0}' myinputfile

Delete field3
awk '$3="";1' myfile
or
awk '{$3="";print}' myfile

Trim leading and trailing whitespace
awk '{$1=$1};1'

For loop to create one file for each line present in the input file
input file
A,1
A,2
B,3
C,4
C,5

#!/bin/bash
FILE=/path/to/file
values=`cat $FILE | awk -F, '{print $1}' | sort | uniq | tr '\n' ' '`
for i in $values; do
echo "value of i is $i" >> file_$i.sh
done

For loop for Babel
In a directory, two files:
a.smi
b.smi

Need to convert to sdf with babel
for f in *.smi; do b=`basename $f .smi` ; echo "Processing ligand $b" ; babel -ismi $f -osdf ${b}_babel-convert.sdf; done

The output will be:
b_babel-convert.sdf
a_babel-convert.sdf