11 Unix Terminal Commands for Data Analysis

4 minute read

It is easy to read a data file into R or Python and run a summary in R or describe in Python pandas to understand its content. There are times, though, when one may encounter data that is either very large and takes a long time to read, or not quite formatted into a clean set of columns. At times like these, unix terminal commands come in handy perform a few elementary checks and computations on the data. Below are 11 operations that can be done on the terminal that you may find useful. We will use the popular IRIS dataset as an example.

1. Download a file

curl http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data > iris.data

We use the curl command to download the data from a url. The downloaded file is saved as iris.data

Print the first 5 lines using head

head -n 5 iris.data 

The output is:

5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa

Print the last 7 lines using tail

tail -n 7 iris.data

The output is

6.7,3.3,5.7,2.5,Iris-virginica
6.7,3.0,5.2,2.3,Iris-virginica
6.3,2.5,5.0,1.9,Iris-virginica
6.5,3.0,5.2,2.0,Iris-virginica
6.2,3.4,5.4,2.3,Iris-virginica
5.9,3.0,5.1,1.8,Iris-virginica


You can extract intermediate lines (say lines 41-43) using a combination of head and tail as below

head -n 43 iris.data | tail -n 3 

This will output

5.0,3.5,1.3,0.3,Iris-setosa
4.5,2.3,1.3,0.3,Iris-setosa
4.4,3.2,1.3,0.2,Iris-setosa

Note that the data has 5 columns representing sepal length (cm), sepal width (cm), petal length (cm), petal width (cm), and type of flower

You may have noticed from the tail command earlier that there is a single blank line at the end of the file. We will remove this using the commands below:

sed '$d' iris.data >iris_clean.dat
mv iris_clean.dat iris.data 

The $d is taken by sed to be the last line of the file. We overwrite the original file with its clean version. You can check that this is the case by re-running the tail command.

3. Number of rows

wc -l iris.data

The output indicates that the file has 150 rows

150 iris.data

4. File size

 du -h iris.data 

The -h option outputs the file size in a human readable format

8.0K	iris.data

5. Extract a column

Using the cut command, we extract the 5th column that contains the iris types.

cut -d, -f5 iris.data | head -n 3

The -d option allows you to specify the separator in the file (in this case a comma), and -f lets you select a column (column 5). The | is the unix pipe command that pipes the output of one command over to the next one, in this case head that is used to see the first few lines of the output

Iris-setosa
Iris-setosa
Iris-setosa

The cut command can be used to select multiple columns or range of columns by column number. This is useful when you are working with a wide file and want to extract only the columns of interest

cut -d, -f4,5 iris.data 
cut -d, -f3-5 iris.data 
cut -d, -f1,5 iris.data 
cut -d, -f1,5 iris.data

An alternative to cut is awk - a programming language designed for text processing.

awk -F, '{print $5}' iris.data 
awk -F, '{print $4","$5}' iris.data

6. Counts

Once we extract a column using cut above, we use sort and uniq to print the counts for all unique values in the column

cat iris.data | cut -d, -f5 | sort | uniq -c

For the last column that contains the iris types, we get the following output

  50 Iris-setosa
  50 Iris-versicolor
  50 Iris-virginica

7. Filter

We filter to extract rows that contain a particular column value using awk. Below, we extract rows that have the value “Iris-versicolor” in column 5

awk -F, '$5=="Iris-versicolor" { print }'  iris.data| head -n 3

The first 3 rows of the output are displayed below

7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,Iris-versicolor
6.9,3.1,4.9,1.5,Iris-versicolor

We can filter on multiple column values as below, where we additionally filter to keep rows that have a value of 3.2 in column 2 that represents sepal width

awk -F, '$5=="Iris-versicolor" && $2==3.2 { print }'  iris.data

This gives the output below

7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,Iris-versicolor
5.9,3.2,4.8,1.8,Iris-versicolor

8. Column sum

We calculate the sum of the first column using awk as below

awk -F, '{sum += $1;} END { print sum}'  iris.data

9. Average of a column with filtering

We calculate the average value of the first column (sepal length) for flowers of the type “Iris-veriscolor”

awk -F, '$5=="Iris-versicolor" {sum += $1; count += 1} END { print sum/count}'  iris.data

This outputs 5.936

9. Average of multiple columns with filtering

We calculate the averages for each of the 4 columns by extending the approach above

awk -F, '$5=="Iris-versicolor" {sum1 += $1;sum2 += $2; sum3 += $3; sum4 += $4; count += 1} END {print "For Iris-versicolor","\n avg. sepal length= ",sum1/count, "\n avg. sepal width =",sum2/count,"\n avg. petal length=",sum3/count, "\n avg. petal width=",sum4/count }' iris.data

The output is:

For Iris-versicolor 
 avg. sepal length=  5.936 
 avg. sepal width = 2.77 
 avg. petal length= 4.26 
 avg. petal width= 1.326

10. Data Aggregation

We are now ready to do a simple aggregation - print the count and average values of columns 1 to 4 by each unique label in column 5 (flower type)

awk -F, '{a[$5]++;b[$5]+=$1;c[$5]+=$2;d[$5]+=$3;e[$5]+=$4} END {for(i in a) print i,a[i],b[i]/a[i],c[i]/a[i],d[i]/a[i],e[i]/a[i]}' iris.data

This will output

Iris-setosa 50 5.006 3.418 1.464 0.244
Iris-versicolor 50 5.936 2.77 4.26 1.326
Iris-virginica 50 6.588 2.974 5.552 2.026

11. Sampling data

For sampling a subset of the data (10% in the example below) you can do the following

awk -F, 'BEGIN {srand()} { if (rand() <= .1) print $0}' iris.data > iris.data.sub

The sampling is approximate (i.e. you won’t have exactly 10% ). For very large data files, this allows you to extract a sample small enough to be read into and analyzed in a higher level language like R or Python.

Updated:

Leave a Comment