doc regarding PDF inputs

jessp01 · jessp01 · commit f29668519465 · 2025-01-07T00:46:35.000Z
diff --git a/README.md b/README.md
@@ -55,6 +55,10 @@ If you take this route, you'll need to copy the `highlight/syntax_files` and `ut
 ### Installing `super-zaje`
 
 `super-zaje` does everything `zaje` does but provides the additional functionality of extracting text from an image. 
+
+**NOTE**: `zaje` is capable of detecting the lexer to use based on the first line of text but with images, you'll often
+need to help it and specify a designated lexer by passing `-l $NAME` (e.g: `zaje -l sh`, `zaje -l server-log`, etc).
+
 It's a separate binary because it depends on the [gosseract](https://github.com/otiai10/gosseract) which in turn
 depends on `libtesseract` and requires its SOs to be available on the machine.
 
@@ -91,8 +95,17 @@ For example, try:
 $ ~/go/bin/super-zaje "https://github.com/jessp01/zaje/blob/master/testimg/go1.png?raw=true"
 ```
 
-**NOTE**: `zaje` is capable of detecting the lexer to use based on the first line of text but with images, you'll often
-need to help it and specify a designated lexer by passing `-l $NAME` (e.g: `zaje -l sh`, `zaje -l server-log`, etc).
+### PDF inputs
+
+PDF files are also supported. For example:
+
+```sh
+$ super-zaje --pdf  --pdf-page-number 63 FORTRAN_colouring_book.pdf
+```
+
+Will convert page **64** (page numbers start from 0 in [go-fitz](https://github.com/gen2brain/go-fitz) which is used by
+super-zaje) to a PNG and pass that on to [gosseract](https://github.com/otiai10/gosseract) for text extraction.
+
 
 
 ### ASCIInema screencast (Not a video!)
diff --git a/cmd/super-zaje/README.md b/cmd/super-zaje/README.md
@@ -9,7 +9,11 @@
 ### Installing `super-zaje`
 
 `super-zaje` does everything `zaje` does but provides the additional functionality of extracting text from an image. 
-It's a separate binary because it depends on the [gosseract](https://github.com/otiai10/gosseract) which in turn
+
+**NOTE**: `zaje` is capable of detecting the lexer to use based on the first line of text but with images, you'll often
+need to help it and specify a designated lexer by passing `-l $NAME` (e.g: `zaje -l sh`, `zaje -l server-log`, etc).
+
+`super-zaje` a separate binary because it depends on the [gosseract](https://github.com/otiai10/gosseract) which in turn
 depends on `libtesseract` and requires its SOs to be available on the machine.
 
 First, install `zaje` using [install_zaje.sh](https://github.com/jessp01/zaje/blob/master/install_zaje.sh), and then...
@@ -45,8 +49,16 @@ For example, try:
 $ ~/go/bin/super-zaje "https://github.com/jessp01/zaje/blob/master/testimg/go1.png?raw=true"
 ```
 
-**NOTE**: `zaje` is capable of detecting the lexer to use based on the first line of text but with images, you'll often
-need to help it and specify a designated lexer by passing `-l $NAME` (e.g: `zaje -l sh`, `zaje -l server-log`, etc).
+### PDF inputs
+
+PDF files are also supported. For example:
+
+```sh
+$ super-zaje --pdf  --pdf-page-number 63 FORTRAN_colouring_book.pdf
+```
+
+Will convert page **64** (page numbers start from 0 in [go-fitz](https://github.com/gen2brain/go-fitz) which is used by
+super-zaje) to a PNG and pass that on to [gosseract](https://github.com/otiai10/gosseract) for text extraction.
 
 
 ```yml
@@ -74,6 +86,10 @@ GLOBAL OPTIONS:
 
    --remove-line-numbers, --rln  Remove line numbers.
 
+   --pdf  Pass if input is a PDF file.
+
+   --pdf-page-number value, --pn value  When working on a PDF, set the page to process (first page is 0, not 1).
+
    --help, -h  show help
 
    --print-version, -V  print only the version