|
5 | 5 | <link href="https://learnbyexample.github.io/atom.xml" rel="self" type="application/atom+xml"/>
|
6 | 6 | <link href="https://learnbyexample.github.io"/>
|
7 | 7 | <generator uri="https://www.getzola.org/">Zola</generator>
|
8 |
| - <updated>2023-06-13T00:00:00+00:00</updated> |
| 8 | + <updated>2023-06-20T00:00:00+00:00</updated> |
9 | 9 | <id>https://learnbyexample.github.io/atom.xml</id>
|
| 10 | + <entry xml:lang="en"> |
| 11 | + <title>CLI tip 29: define fields using FPAT in GNU awk</title> |
| 12 | + <published>2023-06-20T00:00:00+00:00</published> |
| 13 | + <updated>2023-06-20T00:00:00+00:00</updated> |
| 14 | + <link rel="alternate" href="https://learnbyexample.github.io/tips/cli-tip-29/" type="text/html"/> |
| 15 | + <id>https://learnbyexample.github.io/tips/cli-tip-29/</id> |
| 16 | + <content type="html"><p>In <code>awk</code>, the <code>FS</code> variable allows you to define the input field <em>separator</em>. In contrast, <code>FPAT</code> (field pattern) allows you to define what should the fields be made up of.</p> |
| 17 | +<pre data-lang="ruby" style="background-color:#f5f5f5;color:#1f1f1f;" class="language-ruby "><code class="language-ruby" data-lang="ruby"><span>$ s=</span><span style="color:#d07711;">&#39;Sample123string42with777numbers&#39; |
| 18 | +</span><span style="color:#7f8989;"># one or more consecutive digits |
| 19 | +</span><span>$ echo </span><span style="color:#d07711;">&quot;$s&quot; </span><span style="color:#72ab00;">|</span><span> awk </span><span style="color:#72ab00;">-</span><span>v </span><span style="color:#c23f31;">FPAT</span><span style="color:#72ab00;">=</span><span style="color:#d07711;">&#39;[0-9]+&#39; &#39;{print $2}&#39; |
| 20 | +</span><span style="color:#b3933a;">42 |
| 21 | +</span><span> |
| 22 | +</span><span>$ s=</span><span style="color:#d07711;">&#39;coat Bin food tar12 best Apple fig_42&#39; |
| 23 | +</span><span style="color:#7f8989;"># whole words made up of lowercase alphabets and digits only |
| 24 | +</span><span>$ echo </span><span style="color:#d07711;">&quot;$s&quot; </span><span style="color:#72ab00;">|</span><span> awk </span><span style="color:#72ab00;">-</span><span>v </span><span style="color:#c23f31;">FPAT</span><span style="color:#72ab00;">=</span><span style="color:#d07711;">&#39;</span><span style="color:#aeb52b;">\\</span><span style="color:#d07711;">&lt;[a-z0-9]+</span><span style="color:#aeb52b;">\\</span><span style="color:#d07711;">&gt;&#39; </span><span style="color:#72ab00;">-</span><span>v </span><span style="color:#c23f31;">OFS</span><span style="color:#72ab00;">=</span><span>, </span><span style="color:#d07711;">&#39;{$1=$1} 1&#39; |
| 25 | +</span><span>coat,food,tar12,best |
| 26 | +</span><span> |
| 27 | +</span><span>$ s=</span><span style="color:#d07711;">&#39;items: &quot;apple&quot; and &quot;mango&quot;&#39; |
| 28 | +</span><span style="color:#7f8989;"># get the first double quoted item |
| 29 | +</span><span>$ echo </span><span style="color:#d07711;">&quot;$s&quot; </span><span style="color:#72ab00;">|</span><span> awk </span><span style="color:#72ab00;">-</span><span>v </span><span style="color:#c23f31;">FPAT</span><span style="color:#72ab00;">=</span><span style="color:#d07711;">&#39;&quot;[^&quot;]+&quot;&#39; &#39;{print $1}&#39; |
| 30 | +</span><span style="color:#d07711;">&quot;apple&quot; |
| 31 | +</span></code></pre> |
| 32 | +<p><code>FPAT</code> is often used for CSV input where fields can contain embedded delimiter characters. For example, a field content <code>&quot;fox,42&quot;</code> when <code>,</code> is the delimiter.</p> |
| 33 | +<pre data-lang="ruby" style="background-color:#f5f5f5;color:#1f1f1f;" class="language-ruby "><code class="language-ruby" data-lang="ruby"><span>$ s=</span><span style="color:#d07711;">&#39;eagle,&quot;fox,42&quot;,bee,frog&#39; |
| 34 | +</span><span> |
| 35 | +</span><span style="color:#7f8989;"># simply using , as separator isn&#39;t sufficient |
| 36 | +</span><span>$ echo </span><span style="color:#d07711;">&quot;$s&quot; </span><span style="color:#72ab00;">|</span><span> awk </span><span style="color:#72ab00;">-</span><span style="color:#5597d6;">F</span><span>, </span><span style="color:#d07711;">&#39;{print $2}&#39; |
| 37 | +</span><span style="color:#d07711;">&quot;fox |
| 38 | +</span></code></pre> |
| 39 | +<p>For such simpler CSV input, <code>FPAT</code> helps to define fields as starting and ending with double quotes or containing non-comma characters.</p> |
| 40 | +<pre data-lang="ruby" style="background-color:#f5f5f5;color:#1f1f1f;" class="language-ruby "><code class="language-ruby" data-lang="ruby"><span style="color:#7f8989;"># * is used instead of + to allow empty fields |
| 41 | +</span><span>$ echo </span><span style="color:#d07711;">&quot;$s&quot; </span><span style="color:#72ab00;">|</span><span> awk </span><span style="color:#72ab00;">-</span><span>v </span><span style="color:#c23f31;">FPAT</span><span style="color:#72ab00;">=</span><span style="color:#d07711;">&#39;&quot;[^&quot;]*&quot;|[^,]*&#39; &#39;{print $2}&#39; |
| 42 | +</span><span style="color:#d07711;">&quot;fox,42&quot; |
| 43 | +</span></code></pre> |
| 44 | +<p><img src="/images/warning.svg" alt="warning" /> The above will not work for all kinds of CSV files, for example if fields contain escaped double quotes, newline characters, etc. See <a href="https://stackoverflow.com/q/45420535/4082052">stackoverflow: What's the most robust way to efficiently parse CSV using awk?</a> for such cases. You could also use other programming languages such as Perl, Python, Ruby, etc which come with standard CSV parsing libraries or have easy access to third party solutions. There are also specialized command line tools such as <a href="https://github.com/BurntSushi/xsv">xsv</a>.</p> |
| 45 | +<p><strong>Video demo</strong>:</p> |
| 46 | +<p align="center"><iframe width="560" height="315" loading="lazy" src="https://www.youtube.com/embed/1ZQni88a99w" title="YouTube video player" frameborder="0" allow="accelerometer; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></p> |
| 47 | +<br> |
| 48 | +<p><img src="/images/info.svg" alt="info" /> See also my <a href="https://github.com/learnbyexample/learn_gnuawk">GNU awk</a> ebook.</p> |
| 49 | +</content> |
| 50 | + </entry> |
10 | 51 | <entry xml:lang="en">
|
11 | 52 | <title>Python tip 29: negative lookarounds</title>
|
12 | 53 | <published>2023-06-13T00:00:00+00:00</published>
|
|
0 commit comments