Data Processing - BASH vs. the world

I’ve been doing some studying with Ruby and today decided to take if for a spin to solve some data formatting mini-project I had at work.

We have a ton of data, thousands of lines, all from 2023, and I narrowed it down to only one month using some grep regular expression.

cat filename.csv | grep -E "01/[0-9][0-9]/2023"

This basically searches the file and in this example will find anything that has 01/numbers/2023 in it. It will keep that line and everything else is ignored.

So separating out a single month from this huge file wasn’t bad. And using RegEx to quickly grab data is really handy. The problem of sorting through what remains…remains.

The lines are comma separated columns. Its something like

[version],[branch],[DATE],[Success]
[version],[branch],[DATE],[Fail],[User who fixed it]

I needed to get all the lines where it failed and users fixed it. Then I needed to count how many times X user fixed problems, so how many lines they show up in.

DangerMouse   44
VeggieBob     2

Something like that.
I’m thinking Linux tools.
Like awk + sed can get me there but it’s going to be crazy messy and highly illegible.

I tried to use ChatGPT to write me one liners to awk the file for what I need. Basically search for lines with a 5th column. I was using NF and printing the line if the column count was >= a number but it just wouldn’t work. It kept failing and it wasn’t clear why,

To get out of that frustration I got ChatGPT to help me write a Ruby script, and we kept refining it till I finally got what I needed.

Ruby

require 'csv'

filename = 'your_file.csv'

CSV.foreach(filename) do |row|
  if row[4] && !row[4].empty?
  puts row,join(',')
  end
end

That took the January CSV data I had and only displayed the lines with that 5th column, failed runs with users that fixed them. It checks that with row[4] since programming languages have their first row named 0, and then in Ruby print is called puts, so it prints the data.

Ok not bad.
Now I need to sort that list by name, and then do the real tricky part which is getting the count for each name. Again thinking of doing that with sed/awk, I’m just not sure. It’s a bit of mental gymnastics to even imagine how that would work. It’s much easier to use a programming language do it.

“If you’re writing an array in BASH, you’ve gone too far.”
- ThePrimeagen

Starting to enter the territory of those wise words.

Ok, so lets add that sorting and printing functionality to Ruby.

Ruby

require 'csv'

filename = 'your_file.csv'

rows = []   ## Adding and array.
counts = Hash.new(0)  ## creates a new Hash to store the counts of each name.

CSV.foreach(filename) do |row|
  if row[4] && !row[4].empty?
    rows << row  ## now each row it finds that matches the pattern gets added to the array.
                counts[row[4]] += 1  ## Checks the name in the 5th column and adds it to our hash count
  end
end

# sort rows by the 5th column
sorted_rows = rows.sort_by { |row| row[4] }

# print sorted rows
sorted_rows.each do |row|
  puts row.join(',')
end

# Print name counts
puts "\nUser\t\tCount"
counts.sort.each do |name, count|
  puts "#{name}\t\t#{count}"
end

This worked out great, it printed everything I needed crazy fast, and exactly how I wanted it.

I feel like if I had wanted to do this the Linux route it would have taken me a while to google all the unique sed and awk commands.

I’ve also run into issues where I’m on old systems and sed and awk are older versions than what most people use online and features/functionality are missing in that chasm. Not normally a common issue, except if you work at a bank. 🙂
Then it becomes a real thing, and honestly I just want my data. I don’t want to learn the obscure way that will only work on that one server at work because it’s stuck on an old version of RHEL.

So we’re done.. right?

Well, the next thing I wondered is how would this same script look in Python and Go? And maybe even BASH? My theory is that there would be a gradual (or severe) increase in illegibility & complexity between these langauges.

Let’s see.

Python

import csv
from collections import defaultdict

filename = 'your_file.csv'

rows = []
counts = defaultdict(int)  ## Python uses a dictionary vs Ruby's hash. 

#Open file and find rows with user columns.
with open(filename, 'r') as f:
    reader = csv.reader(f)
    for row in reader:
        if row[4]:
            rows.append(row)
            counts[row[4]] += 1

# Sort rows by 5th column
sorted_rows = sorted(rows, key=lambda row: row[4])

# Print sorted rows
for row in sorted_rows:
    print(','.join(row))

# Print name counts
print("\nUser\t\tCount")
for name, count in sorted(counts.items()):
    print(f"{name}\t\t{count}")

Python appears very similar to Ruby. Surprisingly so. The big differences I’m seeing are the file handling process. I really under-appreciated that the CSV module in Ruby just handled that for me behind the scenes. I’m sure this code can be improved because I don’t see anything about closing the open file which can lead to problems. It’s likely not best practice code, it’s ChatGPT translations so they aren’t high quality but they work. 🙂

Python’s readability is good but It’s definitely more “programmy”, like comparing these two lines between Python and Ruby are night and day to my eyes.

### Python
  sorted_rows = sorted(rows, key=lambda row: row[4])
### Ruby
  sorted_rows = rows.sort_by { |row| row[4] }

Python gets a little intense inside those parentheses 🙂 🙂. Some programming langauges I feel are super rigid and structured, my callow mind can’t always remember the proper structure for each step. It’s basic, its just (Your_object, Key_to_sort_by) and both languages require the same thing but the way each does it is so different.

It’s like comparing name brand mouth wash to store brand. It’s the same thing … sure in theory and maybe when comparing ingredients on the label, but the difference is pretty clear when using them.

For someone starting out programming these small differences can be very off putting. When you’re trying to wrap your head around structure and logic, dealing with file handling to read or write a file hurts.

Anyway, somethings in Python read very naturally other things are obtuse.

### Python
for name, count in sorted(counts.items()):
    print(f"{name}\t\t{count}")

### Ruby
counts.sort.each do |name, count|
  puts "#{name}\t\t#{count}"

Ruby feels like it reads a little bit easier, but theory if I can get good at reading the Pythonic way, then moving to other languages maybe much easier. Python has things written more in computer sentence structure while Ruby is more like human sentence structure.

I also want to predict that Python will be the winner as complexity and quantity of data scales upward. But for these kinds of scripts Ruby is so far my pick.

GoLang

package main
import (
 "encoding/csv"
 "fmt"
 "os"
 "sort"
)

func main() {
 filename := "your_file.csv"

 file,err := os.Open(filename)
 if err != nil {
  panic(err)
 }
 defer file.Close()

 reader := csv.NewReader(file)

 records, err := reader.ReadAll()
 if err != nil {
  panic(err)
 }

 var rows [][]string
 counts := make(map[string]int)

 for _, row := range records {
  if row[4] != "" {
   rows = append(rows, row)
   counts[row[4]]++
  }
 }

 // Sort rows by 5th column
 sort.Slice(rows, func(i, j int) bool {
  return rows[i][4] < rows [j][4]
 })

 // Print sorted rows
 for _, row := range rows {
  fmt.Println(row)
 }

 // Print name counts
 fmt.Println("\nUser\t\tCount")
 var names []string
 for name := range counts {
  names = append(names, name)
 }
 sort.Strings(names)
 for _, name := range names {
  fmt.Printf("%s\t\t%d\n", name, counts[name])
 }
}

🤯

Golang is a whole other animal. I do find the error handling comforting there, I’m not sure how much Ruby does under the hood, I know in Python it needs to be added similar to how Golang has it.
It is pretty readble overall but it’s clearly a step up in “programmy” writing. The logical mind is much more dominant here.

In the for loops we see _, which is a way to make things more readable. When Go ranges over a slice (wow) it returns two values (index, value) since we don’t need the index we’re saying _, leave it blank dude.

:= Also adds to legibility by declaring and initializing a variable in one shot. You see in the for loops we declare the name variable for example. It definitely makes things more legible, but you need to know what := is all about before you read the code.

The other thing that Golang has going for it, in a way, is that it needs to be compiled. Before you run that script you need to run “go build process_data.go”, then you can run the executable. That also means that we now have an executable, and when compiling we get some extra logic checks that hopefully protect us from runtime errors.

### GoLang
counts := make(map[string]int)

### Python
counts = defaultdict(int) 

### Ruby
counts = Hash.new(0)

This highlights that little extra complexity in GoLang. I know that complexity means more control and more power to do interesting things. But starting out I want training wheels not a 10 speed mountain bike with electric assist, I wouldn’t know where to begin. Python is still killing it with it’s Huffy do it all aesthetic.

BASH

 #!/bin/bash

filename ="your_file.csv"

# Extract rows with non-empty 5th column
rows=$(awk -F',' '$5!="" {print $0}' $filename | sort -t',' -k5)

# Print sorted rows
echo "$rows"

#Count names in 5th column and sort by name
echo -e "\nUser\t\tCount"
echo "$rows" | cut -d',' -f5 | sort | uniq -c | awk '{print $2 "\t\t" $1 }

Wait, how did Bash just win this challenge by a mile.
It’s illegible (for the most part) but it’s about 1/3 the size of the other scripts. Again as I mentioned awk wasn’t working for me when running things one off in command line, but when I put it in the script, it worked!
And it felt blazing fast.

Frankly I’m amazed, I sometimes feel guilty for using Bash and not doing everything “programatically”. There’s absolutely pressure from management for us all to move to using programming languages. But honestly, the speed and brevity of that script is awesome. The only extra step I needed to take to make this work was to run ‘chmod +x process_data.sh’ to make it executable.

Using the time command in linux I’m seeing

Ruby: 0.890s  - 28 Lines of Code
Python: 0.351s - 26 Lines of Code
GoLang: 0.412s - 56 Lines of Code
Bash: 0.423s  - 13 Lines of Code.

Interesting results as Bash felt the fastest because when the script runs you’re instantly placed at the end of the results. The other scripts you quickly see it run through the other lines of text. It’s a small difference but psychologically it led me to feel that Bash had destroyed the whole group when it comes to speed.

Ruby was far and away the slowest, and surprisingly (or not) Python is amazing at processing data. Kind of what I expected, Python will totally scale better with more complexity and more data.

This is not a fair comparison by any stretch. It’s more about learning all the languages and their differences, thier strengths and maybe putting my mind at ease for using BASH so much. It’s not bad if you can remember all the details.

The other languages made things much more logical in the way it broke down the problem, much more readable.

Ruby might be slower to run, but that .4s “slowness” overlooks the time and effort to write a working script from scratch. With Ruby I think I’ll be able to conceptualize and write it myself.
Hopefully as I continue in Ruby, it will be my Rosetta Stone into programming. Then I can transition to learning more harshly logical and “programmy” langauges.