9.2. Reading files

Read line by line using the for loop and the cat command The cat command, followed by the path of a file, can be used to visualize the content of the file in the command line:

$ cat /Volumes/MylDrive/MRIdata/subjectList.txt
AA0083277
AA0084999
AC0208933
AC0148099
AD0190300
BB0299033
BC0345100
BD0365666
CA0372599
CA0381677
CB0384399
CC0384433
DD0385444
...

Moreover, if you want to read line by line and run a set of instructions on each line, you can combine the cat and for commands.

Read line by line the previous file Read line by line the previous file (subjectList.txt), which contain a list of subject IDs. And copy into a new file (subjectInfo.txt) the subject IDs plus the group they belong to (which can be obtained from the first two letters in the subject ID).

$ FOLDER=/Volumes/MyExternalDrive/MRIdata
$ for line in $(cat $FOLDER/subjectList.txt)
> do
> echo "${line:0:2},${line}" >> $FOLDER/subjectInfo.txt
> done

In this example, I am reading one line of subjectList.txt on every loop and assigning that line of text to the variable line (the variable could have had any other name). Then, I am extracting the first two characters in $line (which represent the subject group) and saving that information, together with the subject ID, into a new file (subjectInfo.txt). I am using ${line:0:2} to extract the first two characters of $line. If I now print the content of the new file, this is what it will contain:

$ cat /Volumes/MyExternalDrive/MRIdata/subjectInfo.txt AA,AA0083277
AA,AA0084999
AC,AC0208933
AC,AC0148099
AD,AD0190300
BB,BB0299033
BC,BC0345100
BD,BD0365666
CA,CA0372599
CA,CA0381677
CB,CB0384399
CC,CC0384433
DD,DD0385444

Do statistics on the numerical values of a column from a text file infoFile.txt is a file that contains some information from a list of subjects. This is the content of the file:

SubjectID	Group	Gender	Ethnicity	Handedness	Age	Movement
AA0083277	Control	M	Hispanic	R	20	0.23525
AA0084999	Patient	M	Hispanic	R	18	0.14564
AC0208933	Control	F	Hispanic	R	17	0.18698
AC0148099	Control	M	NonHispanic	R	21	0.19789
AD0190300	Patient	M	NonHispanic	R	16	0.23454
BB0299033	Control	F	NonHispanic	R	22	0.19752
BC0345100	Control	M	NonHispanic	R	19	0.18789
BD0365666	Patient	F	NonHispanic	R	17	0.14386
CA0372599	Patient	F	NonHispanic	R	20	0.12384
CA0381677	Control	F	NonHispanic	L	17	0.13453
CB0384399	Control	F	Hispanic	R	18	0.45655
CC0384433	Control	M	NonHispanic	R	15	0.13465
DD0385444	Patient	M	Hispanic	R	16	0.32433

In this example we will calculate the minimum, maximum and average movement in the MRI scanner for the subjects in each group and gender. These values should be shown with only three decimals. There are many ways to do that, some of them a lot more efficient than the one presented here, using functions that we have not learn yet. We will use in this case the cat command to read from the file, the for loop, and some non-integer and array operations that have been learned from previous chapters.

The for will read in each loop one line of the csv file and extract the gender, group and movement values from each line. Depending the group and gender, it will add the movement to one of the following arrays:

CM: to save the movement of all male controls.
CF: to save the movement of all female controls.
PM: to save the movement of all male patients.
PF: to save the movement of all female patients.

In bash it is not necessary to initialize an array. Instead, you can start adding values and the first time you add a value to a non-existent array, it will be automatically initialized. When you ask bash the size of an array that hasn’t been initialized, it will return the value zero.

These are the steps to follow in order to calculate the minimum, maximum and average movement from the file:

1. Create a loop that reads each line of the file (except the first one which is just a heather with column names).
2. In each loop do the following:: 2.1. Split the line using the comma as a separator and save that in a variable called ARRAY.; 2.2. Obtain the subject group, which is located in the 2^nd column (position 1 of the array). Remember, bash arrays start in the position 0 (not the position 1).; 2.3. Obtain the subject gender, which is located in the 3^rd column (position 2 of the array).; 2.4. Obtain the subject movement, which is located in the 7^th column (position 6 of the array).; • If group equals "Control" and gender equals "M" (Male):
Add the movement at the end of the array CM. If CM has zero values, the new item should be added to the position 0, if CM has one value, the new item should be added to the position 1 (because the existent item in the array will be in the position 0), and so on. So, every new item is added to the position that is equal to the current size of the array. As a reminder, the size of an array can be obtained with ${#array[@]}.; • If group equals "Control" and gender equals "F" (Female):
Add the movement at the end of the array CF.; • If group equals "Patient" and gender equals "M":
Add the movement at the end of the array PM.; • If group equals "Patient" and gender equals "F":
Add the movement at the end of the array PF.
3. Sort the four arrays with the previously learned command: IFS=$'\n' sorted=($(sort <<<"${array[*]}"))
4. Show the minimum, maximum and average value of each array. Use printf instead of echo in order to show only three decimals per number:: Minimum value: will be the first value in the sorted array.; Maximum value: will be the last value in the sorted array (in the position SIZE_ARRAY – 1).; Average value: will equal to the sum of all values divided by the size of the array. As a reminder, this is the general command used to calculate the average of an array, as shown in previous chapters: IFS='+' avg=$(echo "scale=1;(${array[*]})/${#array[@]}"|bc).

$ n=0
$ for line in $(cat infoFile.csv)
> do
> if [ $((n++)) -gt 0 ]
> then
> IFS=',' read -a ARRAY <<< "${line}"
> GRP=${ARRAY[1]}
> GEN=${ARRAY[2]}
> MOV=${ARRAY[6]}
> if [ "$GRP" == "Control" ] && [ "$GEN" == "M" ]
> then
> CM[${#CM[@]}]=${MOV}
> fi
> if [ "$GRP" == "Control" ] && [ "$GEN" == "F" ]
> then
> CF[${#CF[@]}]=${MOV}
> fi
> if [ "$GRP" == "Patient" ] && [ "$GEN" == "M" ] > then
> PM[${#PM[@]}]=${MOV}
> fi
> if [ "$GRP" == "Patient" ] && [ "$GEN" == "F" ]
> then
> PF[${#PF[@]}]=${MOV}
> fi
> fi
> done

$ IFS=$'\n' sortedCM=($(sort <<<"${CM[*]}"))
$ IFS='+' avg=$(echo "scale=4;(${CM[*]})/${#CM[@]}"|bc)
$ printf "Male Controls:\nMin: %.3f\nMax: %.3f\nAve: %.3f\n" ${sortedCM[0]} ${sortedCM[${#sortedCM[@]} -1]} $avg

$ IFS=$'\n' sortedCF=($(sort <<<"${CF[*]}"))
$ IFS='+' avg=$(echo "scale=4;(${CF[*]})/${#CF[@]}"|bc)
$ printf "Male Controls:\nMin: %.3f\nMax: %.3f\nAve: %.3f\n" ${sortedCF[0]} ${sortedCF[${#sortedCF[@]} -1]} $avg

$ IFS=$'\n' sortedPM=($(sort <<<"${PM[*]}"))
$ IFS='+' avg=$(echo "scale=4;(${PM[*]})/${#PM[@]}"|bc)
$ printf "Male Controls:\nMin: %.3f\nMax: %.3f\nAve: %.3f\n" ${sortedPM[0]} ${sortedPM[${#sortedPM[@]} -1]} $avg

$ IFS=$'\n' sortedPF=($(sort <<<"${PF[*]}"))
$ IFS='+' avg=$(echo "scale=4;(${PF[*]})/${#PF[@]}"|bc)
$ printf "Male Controls:\nMin: %.3f\nMax: %.3f\nAve: %.3f\n" ${sortedPF[0]}

The number of lines of the previous code could be reduced by simplifying the if expressions. The code below is equivalent to the code above:

$ n=0
$ for line in $(cat infoFile.csv)
> do
> if [ $((n++)) -gt 0 ]
> then
> IFS=',' read -a ARRAY <<< "${line}"
> GRP=${ARRAY[1]}
> GEN=${ARRAY[2]}
> MOV=${ARRAY[6]}
> [ "$GRP" == "Control" ] && [ "$GEN" == "M" ] && CM[${#CM[@]}]=${MOV}
> [ "$GRP" == "Control" ] && [ "$GEN" == "F" ] && CF[${#CF[@]}]=${MOV}
> [ "$GRP" == "Patient" ] && [ "$GEN" == "M" ] && PM[${#PM[@]}]=${MOV}
> [ "$GRP" == "Patient" ] && [ "$GEN" == "F" ] && PF[${#PF[@]}]=${MOV}
> fi
> done

$ IFS=$'\n' sortedPF=($(sort <<<"${PF[*]}"))
$ IFS='+' avg=$(echo "scale=4;(${PF[*]})/${#PF[@]}"|bc)
$ printf "Male Controls:\nMin: %.3f\nMax: %.3f\nAve: %.3f\n" ${sortedPF[0]}

You could reduce even more the number of lines in the code:

$ n=0
$ for line in $(cat infoFile.csv)
> do
> if [ $((n++)) -gt 0 ]
> then
> IFS=',' read -a ARRAY <<< "${line}"
> [ "${ARRAY[1]}" == "Control" ] && [ "${ARRAY[2]}" == "M" ] && CM[${#CM[@]}]=${ARRAY[6]}
> [ "${ARRAY[1]}" == "Control" ] && [ "${ARRAY[2]}" == "F" ] && CF[${#CF[@]}]=${ARRAY[6]}
> [ "${ARRAY[1]}" == "Patient" ] && [ "${ARRAY[2]}" == "M" ] && PM[${#PM[@]}]=${ARRAY[6]}
> [ "${ARRAY[1]}" == "Patient" ] && [ "${ARRAY[2]}" == "F" ] && PF[${#PF[@]}]=${ARRAY[6]}
> fi
> done

$ IFS=$'\n' sortedPF=($(sort <<<"${PF[*]}"))
$ IFS='+' avg=$(echo "scale=4;(${PF[*]})/${#PF[@]}"|bc)
$ printf "Male Controls:\nMin: %.3f\nMax: %.3f\nAve: %.3f\n" ${sortedPF[0]}

In the previous example we read line by line a file using a for loop and the cat utility. This works most of the times. However, if you try to read this way a file in which one or more of the lines contain a space, bash will read each word separated by a space as a separate line.

For example, if file test.txt has the following content:
a b
c d
e f
g h
i j

When you try to read each line using a file, this is the result you will get:

$ for line in $(cat test.txt)
> do
> echo $((i++)) $line
> done
0 a
1 b
2 c
3 d
4 e
5 f
6 g
7 h
8 i
9 j

To fix this problem you have to tell bash that newline (\n) is the only separator. You do this by declaring the system variable IFS=$'\n'.

$ IFS=$'\n'
$ for line in $(cat test.txt)
> do
> echo $((i++)) $line
> done
0 a b
1 c d
2 e f
3 g h
4 i j

Load the content of a file into an array and access a specific line separately $ ARRAY=($(cat test.txt))
$ echo ${ARRAY[0]}
a b
$ echo ${ARRAY[1]}
c d
$ echo ${ARRAY[2]}
e f
$ echo ${ARRAY[3]}
g h
$ echo ${ARRAY[4]}
i j