Read line by line using the for loop and the cat command The cat command, followed by the path of a file, can be used to visualize the content of the file in the command line:
$ cat /Volumes/MylDrive/MRIdata/subjectList.txt
AA0083277
AA0084999
AC0208933
AC0148099
AD0190300
BB0299033
BC0345100
BD0365666
CA0372599
CA0381677
CB0384399
CC0384433
DD0385444
...
Moreover, if you want to read line by line and run a set of instructions on each line, you can combine the cat and for commands.
Read line by line the previous file Read line by line the previous file (subjectList.txt), which contain a list of subject IDs. And copy into a new file (subjectInfo.txt) the subject IDs plus the group they belong to (which can be obtained from the first two letters in the subject ID).
$ FOLDER=/Volumes/MyExternalDrive/MRIdata
$ for line in $(cat $FOLDER/subjectList.txt)
> do
> echo "${line:0:2},${line}" >> $FOLDER/subjectInfo.txt
> done
In this example, I am reading one line of subjectList.txt on every loop and assigning that line of text to the variable line (the variable could have had any other name). Then, I am extracting the first two characters in $line (which represent the subject group) and saving that information, together with the subject ID, into a new file (subjectInfo.txt). I am using ${line:0:2} to extract the first two characters of $line. If I now print the content of the new file, this is what it will contain:
$ cat /Volumes/MyExternalDrive/MRIdata/subjectInfo.txt
AA,AA0083277
AA,AA0084999
AC,AC0208933
AC,AC0148099
AD,AD0190300
BB,BB0299033
BC,BC0345100
BD,BD0365666
CA,CA0372599
CA,CA0381677
CB,CB0384399
CC,CC0384433
DD,DD0385444
Do statistics on the numerical values of a column from a text file infoFile.txt is a file that contains some information from a list of subjects. This is the content of the file:
SubjectID | Group | Gender | Ethnicity | Handedness | Age | Movement |
AA0083277 | Control | M | Hispanic | R | 20 | 0.23525 |
AA0084999 | Patient | M | Hispanic | R | 18 | 0.14564 |
AC0208933 | Control | F | Hispanic | R | 17 | 0.18698 |
AC0148099 | Control | M | NonHispanic | R | 21 | 0.19789 |
AD0190300 | Patient | M | NonHispanic | R | 16 | 0.23454 |
BB0299033 | Control | F | NonHispanic | R | 22 | 0.19752 |
BC0345100 | Control | M | NonHispanic | R | 19 | 0.18789 |
BD0365666 | Patient | F | NonHispanic | R | 17 | 0.14386 |
CA0372599 | Patient | F | NonHispanic | R | 20 | 0.12384 |
CA0381677 | Control | F | NonHispanic | L | 17 | 0.13453 |
CB0384399 | Control | F | Hispanic | R | 18 | 0.45655 |
CC0384433 | Control | M | NonHispanic | R | 15 | 0.13465 |
DD0385444 | Patient | M | Hispanic | R | 16 | 0.32433 |
In this example we will calculate the minimum, maximum and average movement in the MRI scanner for the subjects in each group and gender. These values should be shown with only three decimals. There are many ways to do that, some of them a lot more efficient than the one presented here, using functions that we have not learn yet. We will use in this case the cat command to read from the file, the for loop, and some non-integer and array operations that have been learned from previous chapters.
The for will read in each loop one line of the csv file and extract the gender, group and movement values from each line. Depending the group and gender, it will add the movement to one of the following arrays:
CM: to save the movement of all male controls.
CF: to save the movement of all female controls.
PM: to save the movement of all male patients.
PF: to save the movement of all female patients.
In bash it is not necessary to initialize an array. Instead, you can start adding values and the first time you add a value to a non-existent array, it will be automatically initialized. When you ask bash the size of an array that hasn’t been initialized, it will return the value zero.
These are the steps to follow in order to calculate the minimum, maximum and average movement from the file:
• If group equals "Control" and gender equals "M" (Male):
Add the movement at the end of the array CM. If CM has zero values, the new item should be added to the position 0, if CM has one value, the new item should be added to the position 1 (because the existent item in the array will be in the position 0), and so on. So, every new item is added to the position that is equal to the current size of the array. As a reminder, the size of an array can be obtained with ${#array[@]}.
• If group equals "Control" and gender equals "F" (Female):
Add the movement at the end of the array CF.
• If group equals "Patient" and gender equals "M":
Add the movement at the end of the array PM.
• If group equals "Patient" and gender equals "F":
Add the movement at the end of the array PF.
$ n=0
$ for line in $(cat infoFile.csv)
> do
> if [ $((n++)) -gt 0 ]
> then
> IFS=',' read -a ARRAY <<< "${line}"
> GRP=${ARRAY[1]}
> GEN=${ARRAY[2]}
> MOV=${ARRAY[6]}
> if [ "$GRP" == "Control" ] && [ "$GEN" == "M" ]
> then
> CM[${#CM[@]}]=${MOV}
> fi
> if [ "$GRP" == "Control" ] && [ "$GEN" == "F" ]
> then
> CF[${#CF[@]}]=${MOV}
> fi
> if [ "$GRP" == "Patient" ] && [ "$GEN" == "M" ]
> then
> PM[${#PM[@]}]=${MOV}
> fi
> if [ "$GRP" == "Patient" ] && [ "$GEN" == "F" ]
> then
> PF[${#PF[@]}]=${MOV}
> fi
> fi
> done
$ IFS=$'\n' sortedCM=($(sort <<<"${CM[*]}"))
$ IFS='+' avg=$(echo "scale=4;(${CM[*]})/${#CM[@]}"|bc)
$ printf "Male Controls:\nMin: %.3f\nMax: %.3f\nAve: %.3f\n" ${sortedCM[0]} ${sortedCM[${#sortedCM[@]} -1]} $avg
$ IFS=$'\n' sortedCF=($(sort <<<"${CF[*]}"))
$ IFS='+' avg=$(echo "scale=4;(${CF[*]})/${#CF[@]}"|bc)
$ printf "Male Controls:\nMin: %.3f\nMax: %.3f\nAve: %.3f\n" ${sortedCF[0]} ${sortedCF[${#sortedCF[@]} -1]} $avg
$ IFS=$'\n' sortedPM=($(sort <<<"${PM[*]}"))
$ IFS='+' avg=$(echo "scale=4;(${PM[*]})/${#PM[@]}"|bc)
$ printf "Male Controls:\nMin: %.3f\nMax: %.3f\nAve: %.3f\n" ${sortedPM[0]} ${sortedPM[${#sortedPM[@]} -1]} $avg
$ IFS=$'\n' sortedPF=($(sort <<<"${PF[*]}"))
$ IFS='+' avg=$(echo "scale=4;(${PF[*]})/${#PF[@]}"|bc)
$ printf "Male Controls:\nMin: %.3f\nMax: %.3f\nAve: %.3f\n" ${sortedPF[0]}
The number of lines of the previous code could be reduced by simplifying the if expressions. The code below is equivalent to the code above:
$ n=0
$ for line in $(cat infoFile.csv)
> do
> if [ $((n++)) -gt 0 ]
> then
> IFS=',' read -a ARRAY <<< "${line}"
> GRP=${ARRAY[1]}
> GEN=${ARRAY[2]}
> MOV=${ARRAY[6]}
> [ "$GRP" == "Control" ] && [ "$GEN" == "M" ] && CM[${#CM[@]}]=${MOV}
> [ "$GRP" == "Control" ] && [ "$GEN" == "F" ] && CF[${#CF[@]}]=${MOV}
> [ "$GRP" == "Patient" ] && [ "$GEN" == "M" ] && PM[${#PM[@]}]=${MOV}
> [ "$GRP" == "Patient" ] && [ "$GEN" == "F" ] && PF[${#PF[@]}]=${MOV}
> fi
> done
$ IFS=$'\n' sortedCM=($(sort <<<"${CM[*]}"))
$ IFS='+' avg=$(echo "scale=4;(${CM[*]})/${#CM[@]}"|bc)
$ printf "Male Controls:\nMin: %.3f\nMax: %.3f\nAve: %.3f\n" ${sortedCM[0]} ${sortedCM[${#sortedCM[@]} -1]} $avg
$ IFS=$'\n' sortedCF=($(sort <<<"${CF[*]}"))
$ IFS='+' avg=$(echo "scale=4;(${CF[*]})/${#CF[@]}"|bc)
$ printf "Male Controls:\nMin: %.3f\nMax: %.3f\nAve: %.3f\n" ${sortedCF[0]} ${sortedCF[${#sortedCF[@]} -1]} $avg
$ IFS=$'\n' sortedPM=($(sort <<<"${PM[*]}"))
$ IFS='+' avg=$(echo "scale=4;(${PM[*]})/${#PM[@]}"|bc)
$ printf "Male Controls:\nMin: %.3f\nMax: %.3f\nAve: %.3f\n" ${sortedPM[0]} ${sortedPM[${#sortedPM[@]} -1]} $avg
$ IFS=$'\n' sortedPF=($(sort <<<"${PF[*]}"))
$ IFS='+' avg=$(echo "scale=4;(${PF[*]})/${#PF[@]}"|bc)
$ printf "Male Controls:\nMin: %.3f\nMax: %.3f\nAve: %.3f\n" ${sortedPF[0]}
You could reduce even more the number of lines in the code:
$ n=0
$ for line in $(cat infoFile.csv)
> do
> if [ $((n++)) -gt 0 ]
> then
> IFS=',' read -a ARRAY <<< "${line}"
> [ "${ARRAY[1]}" == "Control" ] && [ "${ARRAY[2]}" == "M" ] && CM[${#CM[@]}]=${ARRAY[6]}
> [ "${ARRAY[1]}" == "Control" ] && [ "${ARRAY[2]}" == "F" ] && CF[${#CF[@]}]=${ARRAY[6]}
> [ "${ARRAY[1]}" == "Patient" ] && [ "${ARRAY[2]}" == "M" ] && PM[${#PM[@]}]=${ARRAY[6]}
> [ "${ARRAY[1]}" == "Patient" ] && [ "${ARRAY[2]}" == "F" ] && PF[${#PF[@]}]=${ARRAY[6]}
> fi
> done
$ IFS=$'\n' sortedCM=($(sort <<<"${CM[*]}"))
$ IFS='+' avg=$(echo "scale=4;(${CM[*]})/${#CM[@]}"|bc)
$ printf "Male Controls:\nMin: %.3f\nMax: %.3f\nAve: %.3f\n" ${sortedCM[0]} ${sortedCM[${#sortedCM[@]} -1]} $avg
$ IFS=$'\n' sortedCF=($(sort <<<"${CF[*]}"))
$ IFS='+' avg=$(echo "scale=4;(${CF[*]})/${#CF[@]}"|bc)
$ printf "Male Controls:\nMin: %.3f\nMax: %.3f\nAve: %.3f\n" ${sortedCF[0]} ${sortedCF[${#sortedCF[@]} -1]} $avg
$ IFS=$'\n' sortedPM=($(sort <<<"${PM[*]}"))
$ IFS='+' avg=$(echo "scale=4;(${PM[*]})/${#PM[@]}"|bc)
$ printf "Male Controls:\nMin: %.3f\nMax: %.3f\nAve: %.3f\n" ${sortedPM[0]} ${sortedPM[${#sortedPM[@]} -1]} $avg
$ IFS=$'\n' sortedPF=($(sort <<<"${PF[*]}"))
$ IFS='+' avg=$(echo "scale=4;(${PF[*]})/${#PF[@]}"|bc)
$ printf "Male Controls:\nMin: %.3f\nMax: %.3f\nAve: %.3f\n" ${sortedPF[0]}
In the previous example we read line by line a file using a for loop and the cat utility. This works most of the times. However, if you try to read this way a file in which one or more of the lines contain a space, bash will read each word separated by a space as a separate line.
For example, if file test.txt has the following content:
a b
c d
e f
g h
i j
When you try to read each line using a file, this is the result you will get:
$ for line in $(cat test.txt)
> do
> echo $((i++)) $line
> done
0 a
1 b
2 c
3 d
4 e
5 f
6 g
7 h
8 i
9 j
To fix this problem you have to tell bash that newline (\n) is the only separator. You do this by declaring the system variable IFS=$'\n'.
$ IFS=$'\n'
$ for line in $(cat test.txt)
> do
> echo $((i++)) $line
> done
0 a b
1 c d
2 e f
3 g h
4 i j
Load the content of a file into an array and access a specific line separately
$ ARRAY=($(cat test.txt))
$ echo ${ARRAY[0]}
a b
$ echo ${ARRAY[1]}
c d
$ echo ${ARRAY[2]}
e f
$ echo ${ARRAY[3]}
g h
$ echo ${ARRAY[4]}
i j