Skip to content

Latest commit

 

History

History
476 lines (314 loc) · 13.5 KB

hw05.md

File metadata and controls

476 lines (314 loc) · 13.5 KB

 


 

🌀 CSC510: Software Engineering
NC State, Spring '25

Our notes here come from Software Carpentary notes and the amazing MIT missing semester subject.

As such, we share their copyright: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International,

Introduction

Make is a tool which can run commands to read files, process these files in some way, and write out the processed files. For example, in software development, Make is used to compile source code into executable programs or libraries, but Make can also be used to:

  • run analysis scripts on raw data files to get data files that summarize the raw data;
  • run visualization scripts on data files to produce plots; and to
  • parse and combine text files and plots to create papers.
  • record, in an executable form, all your short cuts

Make is called a build tool - it builds data files, plots, papers, programs or libraries. It can also update existing files if desired.

Make tracks the dependencies between the files it creates and the files used to create these. If one of the original files (e.g. a data file) is changed, then Make knows to recreate, or update, the files that depend upon this file (e.g. a plot).

Make was first released in April 1977 by Stuart Feldman at Bell Labs. That makes it nearly 50 years old—but it's still widely used since:

  • Ubiquity & StabilityMake is deeply embedded in Unix-based systems, well-documented, and universally supported.
  • Minimal OverheadNewer tools (CMake, Ninja, Bazel) can be heavier, requiring extra dependencies or complex setup.
  • Declarative & EfficientIt tracks dependencies automatically, only rebuilding what's necessary, without overengineering.
  • Works EverywhereUnlike newer tools, Make runs on virtually any system without special installation or configuration.

Make was motivated by a need for tool to automate software compilation dependencies—a problem that was becoming more complex as projects grew. The Make work laid the foundation for modern build automation and dependency management.

That said, if you're dealing with massive parallel builds, cross-platform projects, or non-C/C++ languages, newer tools might be a better fit. e.g.

  • CMake – Meta-build system that generates Makefiles or Ninja files, common for C/C++.
  • Ninja – Fast, minimal build system, often used with CMake.
  • Meson – Simpler, faster alternative to CMake, optimized for performance.
  • Bazel – Scalable build tool from Google, great for large, multi-language projects.
  • Gradle (Java/Kotlin) – Powerful build automation, used for Android development.
  • Cargo (Rust) – Rust’s built-in build system and package manager.
  • Maven (Java) – Dependency management and build tool for Java applications.
  • SBT (Scala/Java) – Optimized for Scala, supports parallel execution.
  • Just – Modern, simple task runner, like Make but with cleaner syntax.
  • Taskfile – YAML-based task runner, popular in the Go ecosystem.
  • Rake (Ruby) – Ruby’s Make-like tool for automation and builds.
  • Leiningen (Clojure) – manages dependencies and tasks.

Key Makefile Concepts

A Makefile consists of rules, targets, dependencies, and commands. The basic structure is:

target: dependencies
	command
  • target: The output file or action.
  • dependencies: Files required to produce the target.
  • command: The shell command to generate the target.

By default, Make runs a Makefile of commands in the local directory (but this can be reset using make -f file.mk).

Example

updata.txt: data.txt
   echo "uping..."
   gawk '{print toupper($$0)}' data.txt > updata.txt

Note that all command lines must start with a tab. Aso, $$ is the Make variable that returns $ (so $$0 is the gawk variable $0 but $(SHELL) is a make variable

Run with:

make updata.txt

In this example?

Makefiles are traditionally used for compiling code, but they can
also be powerful tools for automating data-driven workflows. This
tutorial focuses on using make alongside gawk, grep, and sed to
process data efficiently.

is processed and saved eslwhere as

MAKEFILES ARE TRADITIONALLY USED FOR COMPILING CODE, BUT THEY CAN
ALSO BE POWERFUL TOOLS FOR AUTOMATING DATA-DRIVEN WORKFLOWS. THIS
TUTORIAL FOCUSES ON USING MAKE ALONGSIDE GAWK, GREP, AND SED TO
PROCESS DATA EFFICIENTLY.

Note that the commands are only run on the dependencies are new than than the target.

  • So if we run this twice, we only see the echo once.
$ make -s updata.txt
uping...


$ make -s  updata.txt
$

Note that make -s siences verbose output. Don't use it during debugging (when youw ant to see those errors)

Automatic Variables

Make provides several automatic variables that simplify rule definitions:

  • $@ - The target filename.
  • $< - The first prerequisite (dependency).
  • $^ - All prerequisites.

Example:

%.uppercase: %
	gawk '{print toupper($0)}' $< > $@

Run:

make file.txt.uppercase

Wildcard and Pattern Matching

DATAFILES := $(wildcard dataset/*.txt)
PROCESSED := $(patsubst dataset/%, processed/%, $(DATAFILES))


processed/%.txt: dataset/%.txt
	gawk '{print toupper($0)}' $< > $@

Functions in Makefiles

EXT = txt
FILES = $(wildcard *.$(EXT))


print_files:
	@echo $(FILES)

Self-Documenting Makefiles

.PHONY: help
help : Makefile ## print help
	echo ""; printf "usage: make [OPTIONS]\n\n"
	@gawk 'BEGIN {FS="[ \t]*:.*##[ \t]*"}  \
	  NF==2 { printf \
           "  \033[36m%-25s\033[0m %s\n","make " $$1,$$2}'  $< \
	| grep -v awk

@: run without echoding line : used to continue previous line.

  • Pro-tip each line in the command are a seperate shell environment (unless you continue them)

Run:

make help

Short cuts

LOUD = \033[1;34m#
SOFT = \033[0m#


push: ## commit to main
	- echo -en "$(LOUD)Why this push? $(SOFT)" ;  read x ; git commit -am "$$x" ;  git push
	- git status

Reproducaibilityity

exp1: ../../moot/optimize/[chmp]*/*.csv
	$(foreach f, $^, (lua kah.lua --comparez $f & ); )


experiment1:; $(MAKE) exp1 | tee ~/tmp/[email protected]
  • Output goes to ~/tmp/experiment1.out (and gets copied to screen along the way)- Note the dependencies with wildcards.
  • Note the run in background command "&"
    • These four lines set up dozens of parralle tasks. Brings your computer to its knees (for a while).

Data Processing with gawk, grep, and sed

Extracting Data with grep

filter: dataset/data.txt
	grep "pattern" dataset/data.txt > filtered.txt

Processing with gawk

analyze: dataset/data.txt
	gawk '{sum += $$2} END {print "Total:", sum}' dataset/data.txt > summary.txt

Stream Editing with sed

transform: dataset/data.txt
	sed 's/old/new/g' dataset/data.txt > transformed.txt

Homework: Text Processing with Make

What to Hand In

Submit a link to a directory in a repository containing:

  • All your data and scripts
  • The following output files:
    • step1.txt
    • step2.txt
    • step3.txt
    • step4.txt

Submitting Your Work

Create a PDF file containing:

  • Your repository link
  • Group number
  • Names of all participants (Format: Name - GitHub User ID)

📌 Important: Only one group member submits the work.

Assignment Details

The file available at data.txt contains approximately 50 paragraphs.

Your task is to:

  1. Clean the file: Remove punctuation and convert text to lowercase.
  2. Remove stop words: Eliminate common stop words from the text.
  3. Identify the ten most frequent words: Find the ten most used terms in the text.
  4. Generate a word frequency table: Create a new file with one line per paragraph, showing the frequency counts of the top ten terms in each paragraph.

For example, if the most frequent words are:

challenging,do,key,and,data,example,aspect,engineering,we,

and in the first paragraph challenging and aspect appear once, while data appears three times, the output should look like:

challenging,do,key,and,data,example,aspect,engineering,we,
1,0,0,0,3,0,1,0,0,
0,1,1,2,1,1,0,0,1,
2,0,1,4,3,2,0,0,0,
0,0,1,1,1,1,2,0,3,
1,1,2,2,1,1,0,2,5,
3,2,1,1,4,0,0,1,1
...

Makefile Instructions

You are provided with a Makefile to automate the process. You need to write killstopXXX.awk, a magic piece of scripting at YYY and ZZZ.awk

Makefile Code:

# Define variables
INPUT = data.txt
CLEANED = cleaned.txt
STOPPED = stop.txt
TOKENS = tokens.txt
FREQS = word_counts.txt
TOP_WORDS = top.txt
TABLE = table.txt

# Run the full pipeline
all: $(TABLE)

# Step 1: Canonicalization (Kill punctuation, Lowercase, Remove Punctuation)
$(CLEANED): $(INPUT)
	sed 's/[^a-zA-Z ]//g' $< | tr 'A-Z' 'a-z' > $@

# Step 2: Remove stop words
# Stop words are (is|the|in|but|can|a|the|is|in|of|to|a|that|it|for|on|with|as|this|was|at|by|an|be|from|or|are)
$(STOPPED) : $(CLEANED)
	gawk -f killstopXXX.awk $< > $@

# Step 3: Report frequency of words
$(FREQS): $(STOPPED)
	cat $< | YYY | sort -nr > $@

# Step 4: Extract Top 10 most frequent words
$(TOP_WORDS): $(FREQS)
	gawk '$$2 && NR <=10 {print $$2}' $< > $@

# Step 5: Generate table of word frequencies per paragraph
$(TABLE): $(CLEANED) $(TOP_WORDS)
	gawk -f ZZZ.awk PASS=1 $(TOP_WORDS) PASS=2 $(CLEANED) > $@

# Cleanup
clean:
	rm -f $(CLEANED) $(TOKENS) $(FREQS) $(TOP_WORDS) $(TABLE)

step1:
	$(MAKE) clean $(CLEANED); head $(CLEANED)

step2:
	$(MAKE) clean $(STOPPED); head $(STOPPED)

step3:
	$(MAKE) clean $(TOP_WORDS); head $(TOP_WORDS)

step4:
	$(MAKE) clean $(TABLE); head $(TABLE) # | column -s, -t

(Aside: your system may not have column so that is commented on on the last line. But if you have column, then you get a nice pretty-print.)

Makefile Tasks:

  • Step 1: Text Cleaning
    • Removes punctuation and converts text to lowercase.
    • Run using: make step1
  • Step 2: Stop Word Removal
    • Eliminates common stop words.
    • Run using: make step2
  • Step 3: Identify Frequent Words
    • Extracts the top ten most frequent words.
    • Run using: make step3
  • Step 4: Generate Word Frequency Table
    • Counts occurrences of top words in each paragraph.
    • Run using: make step4

Running Your Code

The beauty of all this is that you can test each piece. To reset everything and just generate the cleaned.txt file:

make step1

To test the final output:

make step4

Grading Scale

Total: 1 point.

  • Each task is worth 0.25 points.
  • Ensure all steps are completed to receive full credit.

QA

  • (Q1) When we print the list of most frequent words, do they have to be sorted in frequency order?

  • (A1). no. any order at all. but the key is to NOT print the less frequent words

  • (Q2) What is this Pass=1 Pass=2 thing on the command like of the last gawk call?

  • (A2) in gawk, you can set variables on the command line and query them inside the script e.g. the following asks gawk to run thru two files, each time with a different setting to "PASS"

gawk -f code.awk PASS=1 config.txt PASS=2 data.txt

then inside the code

PASS==1 { # do stuff with config.txt }
PASS==2 { # do stuff with data.txt

pro-tip: gawks NR variables counts TOTAL records seen so far while FNR counts number of records in the current file. don't get them confused. so the following code would put file contents into two different arrays, each starting at 1

PASS==1 { config[NFR] = $0 }
PASS==2 { data[FNR] = $0 }

but this code would (mistakenly) start data at size of config + 1

PASS==1 { config[NR] = $0 }
PASS==2 { data[NR] = $0 }