9.3. Using the read command for more complex csv files

So far, we have learned that using the for loop and the cat utility you can read each line of a file and separate it into different fields using a separator. However, csv files can become very difficult to separate into fields if some of them contain a comma (the same character that is being used as a separator), a space, or both.

Example: Obtain the last field of $line using the concepts learned before. $ line="SUBJ20"," Age 22-30","VISIT1","1","DIAGN: Major Depressive Disorder, Single Episode, In Full Remission"
$ IFS=',' read -a ARRAY >>> "$line"
$ echo "The last field of line is: "${ARRAY[4]}
The last field of line is: DIAGN: Major Depressive Disorder

However, this is not the correct result. The last field of $line is "DIAGN: Major Depressive Disorder, Single Episode, In Full Remission". But because we are using a comma as a separator, bash is separating this field into separate columns. To solve this problem, you can read from the file descriptor and save each field in a separate variable using the read utility.

The first step is to assign a file descriptor (which must be an integer number) to the input file:
$ exec 3< $INPUT_FILE

Then, to read each line of the file and save each field in a different variable:
$ read -u 3 a b

The previous command will read the next line of the input file with descriptor 3 (the number following flag -u) and save the first field in variable a and the rest of the line in variable b.
$ read -u 3 a b c

The previous command will read the next line of the input file with descriptor 3 and save the first field in variable a, the second field in variable b, and the rest of the line in variable c (variable c would be empty if there are no more fields to read). If you want to separate the fields using comma as a separator, you should use the following command instead:
$ IFS=',' read -u 3 a b c d

If one of the columns contains a comma but is surrounded by quotation marks, then it will read the text inside the quotation marks as a single field. In the example before, it will read the following text as the last field: $ "DIAGN: Major Depressive Disorder, Single Episode, In Full Remission".
Each time you type the command $ FS=',' read -u 3 a b c d it will read the following line.

Read each line of a file and save the first and last fields into a new file Given the file example.csv with the following content:
"SUBJ1","Age 22-30","VISIT1","DIAGN: Major Depressive Disorder, Single Episode"
"SUBJ2","Age 22-30","VISIT1","DIAGN: Bipolar, Schizophrenia"
"SUBJ3","Age 22-30","VISIT1","DIAGN: Major Depressive Disorder"
"SUBJ4","Age 22-30","VISIT1","DIAGN: Autism, Dyslexia, ADHD"
Read each line of the file and save the first and last fields into a new file called result.csv.

Assign the file descriptor 3 to example.csv for input
$ exec 3< example.csv
Obtain the number of lines in the input file.
$ N=$(cat example.csv | wc -l)
$ echo $N
Iterate through all the lines of the file saving each field in a different variable. Then, write the value of the first and last fields into the output file.
$ i=0
$ while [ $((i++)) -lt $N ]
> do
> IFS=',' read -u 3 f1 f2 f3 f4
> echo "$f1,$f4" >> result.csv
> done
You must close the file descriptor using the following command (replace number 3 by the corresponding file descriptor):
$ exec 3<&-
Read the content of the output file.
$ cat result.csv
"SUBJ1","Age 22-30","VISIT1","DIAGN: Major Depressive Disorder, Single Episode"
"SUBJ2","Age 22-30","VISIT1","DIAGN: Bipolar, Schizophrenia"
"SUBJ3","Age 22-30","VISIT1","DIAGN: Major Depressive Disorder"
"SUBJ4","Age 22-30","VISIT1","DIAGN: Autism, Dyslexia, ADHD"