9.8. Examples: Reading specific columns from a file or a list of files and sorting the result in alphabetical order and/or removing duplicate values

Print the first column of file1.csv and file2.txt in alphabetical order. In order to do this, you sort the result after printing the first column. First, use awk to print the desired column, followed by | sort to sort it in alphabetical order.

Space-separated file	Comma-separated file
$ awk '{print $1}' file2.txt \| sort "AnonymizedID" "B11110676" "B11130912" "B11131605" "B11133232" "B11133232" "B11134987" "B11135292" "B11137120" "B11137244" "B11137244" "B11137784" "B11144100" "B11144345" "B11150911" "B11152577" "B11154532" "B11154534" "B11155267" "B11155267" "B11156098" "B11156098" "B11156453" "B12226566" "B33191224" "C11137159" "C11138912" "C11138929" "C11138999"	$ awk -F ',' '{print $1}' file1.csv \| sort "Anonymized ID" "B11110455" "B11110603" "B11110690" "B11110925" "B11110927" "B11131290" "B11135072" "B11135291" "B11135291" "B11137879" "B11137879" "B11141503" "B11144410" "B11147712" "B11152799" "B11153927" "B11154358" "B11157958" "B11157974" "B11177579" "B11177806" "B33191224" "B33191224" "B33199522" "B33199522" "B33199603" "B33199603" "C11137159"

Print the first column of file1.csv and file2.txt in alphabetical removing any duplicate values. In order to do this, you remove the duplicates after printing and sorting the first column. First, use awk to print the desired column, followed by sort | uniq to sort and remove the duplicates on the result.

Space-separated file	Comma-separated file
$ awk '{print $1}' file2.txt \| sort \| uniq "AnonymizedID" "B11130912" "B11137244" "B11154534" "B11144100" "B11137244" "B12226566" "B11134987" "B11144345" "C11137159" "B11156453" "B11110676" "C11138929" "B11154532" "B11155267" "B11137120" "B33191224" "B11155267" "C11138999" "B11131605" "B11137784" "B11156098" "B11133232" "B11135292" "C11138912" "B11150911" "B11152577" "B11156098" "B11133232"	$ awk -F ',' '{print $1}' file1.csv \| sort \| uniq "Anonymized ID" "B33199522" "B33199603" "B11137879" "B11144410" "B11110455" "B11135291" "B11153927" "B11177579" "B11177806" "B11157958" "B11110690" "B11152799" "B11154358" "B11110925" "B11135291" "B11135072" "B33199603" "B11137879" "B11110603" "B11110927" "B11147712" "B33191224" "B11131290" "B11157974" "B33191224" "B11141503" "C11137159" "B33199522"

Space-separated file

Comma-separated file

$ awk '{print $1}' file2.txt | sort | uniq
"AnonymizedID"
"B11130912"
"B11137244"
"B11154534"
"B11144100"
"B11137244"
"B12226566"
"B11134987"
"B11144345"
"C11137159"
"B11156453"
"B11110676"
"C11138929"
"B11154532"
"B11155267"
"B11137120"
"B33191224"
"B11155267"
"C11138999"
"B11131605"
"B11137784"
"B11156098"
"B11133232"
"B11135292"
"C11138912"
"B11150911"
"B11152577"
"B11156098"
"B11133232"

$ awk -F ',' '{print $1}' file1.csv | sort | uniq
"Anonymized ID"
"B33199522"
"B33199603"
"B11137879"
"B11144410"
"B11110455"
"B11135291"
"B11153927"
"B11177579"
"B11177806"
"B11157958"
"B11110690"
"B11152799"
"B11154358"
"B11110925"
"B11135291"
"B11135072"
"B33199603"
"B11137879"
"B11110603"
"B11110927"
"B11147712"
"B33191224"
"B11131290"
"B11157974"
"B33191224"
"B11141503"
"C11137159"
"B33199522"

Print the first column of file1.csv and file3.csv combined, in alphabetical order and with no duplicates.
As explained in a previous example, to print the first column of file1.csv and file3.csv combined just use the command awk with the list of files to be read (file1.csv file3.csv) at the end of the command. Then, use | sort to organize the output in alphabetical order, and finally use | uniq to remove the duplicates.

In this case, because the strings in file1.csv all start by colons ("), while the values in file3.csv don't, then all the values of file1.csv will be printed before those of file3.csv (because alphabetically, special characters such as " go before any letter (including A). So, for bash "B11110455" goes before Anonymized ID.

$ awk -F ',' '{print $1}' file1.csv file3.csv | sort | uniq
"Anonymized ID"
"B11110455"
"B11110603"
"B11110690"
"B11110925"
"B11110927"
"B11131290"
"B11135072"
"B11135291"
"B11137879"
"B11141503"
"B11144410"
"B11147712"
"B11152799"
"B11153927"
"B11154358"
"B11157958"
"B11157974"
"B11177579"
"B11177806"
"B33191224"
"B33199522"
"B33199603"
"C11137159"
Anonymized ID
B11108326
B11108399
B11110893
B11119903
B11119909
B12226507
B12226546
C11131039
C11133100
C11135566
C11137123
C11137159
C11137167
C11137439
C11137443
C11137544
C11138122
C11138150
C11138152
C11138184
C11138192
C11138797
D11144030