Update files

leikareipa · Mar 7, 2024 · 1c0a681 · 1c0a681
1 parent f0589e2
commit 1c0a681
Show file tree

Hide file tree

Showing 7 changed files with 214 additions and 9 deletions.
diff --git a/blog/claude-3s-exceptional-abilities-at-obscure-languages/content.md b/blog/claude-3s-exceptional-abilities-at-obscure-languages/content.md
@@ -0,0 +1,17 @@
+<post-date date="7 March 2024"/>
+
+# Claude 3's exceptional abilities at obscure languages
+
+Earlier this week, [Anthropic launched Claude 3](https://www.anthropic.com/news/claude-3-family), its next-generation family of LLMs. The models &ndash; Opus, Sonnet, and (soon-to-be-released) Haiku &ndash; have already made waves for their ability to trade blows with the previous state-of-the-art, GPT-4.
+
+In my own testing, I've found Claude 3 to be quite capable and even worthy of hype to some extent. It's not a GPT-4 killer in a general sense, but, for example, [the Opus model matches GPT-4 in common programming tasks](/blog/testing-a-medley-of-local-llms-for-coding/).
+
+## The meat
+
+One standout aspect of the Claude 3 Opus model in particular is that it appears to be exceptionally good at reconstructing representations from uncommon data.
+
+In a recent blog post, [I found that Opus is almost twice as good as GPT-4 and nearly five times as good as GPT-4 Turbo at generating code in an obscure variety of assembly language](/blog/llm-performance-in-retro-assembly-coding/). While in theory it's possible that the difference would come from Anthropic upscaling on training data for obsolete assembly or related topics, I might side with the alternative that Opus is more generally accurate at extrapolating from limited data.
+
+User reports have also begun popping up of Opus being very capable at dealing with obscure human languages ([for example](https://www.reddit.com/r/singularity/comments/1b8603h/claude_3_opus_is_the_first_language_model_that/)). I can confirm that Opus is able to hold a conversation in a language that has very few native speakers and at which GPT-4 fails almost completely. The model does make mistakes in this, but the output is generally reasonable and understandable.
+
+It's not clear whether this standout performance is an emergent ability or a targeted effort by Anthropic to increase the representation of languages in Claude, but to me it seems possible that the model is in fact an unusually strong extrapolator of uncommon representations. I assume we'll get a research paper or two on this at some point.
diff --git a/blog/claude-3s-exceptional-abilities-at-obscure-languages/index.html b/blog/claude-3s-exceptional-abilities-at-obscure-languages/index.html
@@ -0,0 +1,85 @@
+<!DOCTYPE html>
+<html>
+    <head>
+        <meta name="viewport" content="width=device-width">
+        <meta http-equiv="content-type" content="text/html; charset=UTF-8">
+
+        <link rel="stylesheet" href="../+assets/blog.css">
+        <link rel="stylesheet" href="/assets/font-awesome-5-15-4/css/all.min.css">
+        <script defer src="/assets/font-awesome-5-15-4/attribution.js"></script>
+        <script defer src="../+assets/highlight.min.js"></script>
+        <script defer src="/dokki/distributable/dokki.js"></script>
+        <script type="module" src="../+assets/blog-post-widgets.js"></script>
+        <script type="module" src="../+assets/post-date.js"></script>
+
+        <style>
+            .dokki-table.results th.name {
+                writing-mode: vertical-lr;
+                font-weight: normal;
+            }
+            .dokki-table.results td:first-child {
+                white-space: pre-line;
+                min-width: 20rem;
+            }
+
+            .dokki-table.results td,
+            .dokki-table.results th {
+                width: 0px !important;
+            }
+            .dokki-table.results td.s0,
+            .dokki-table.results td.s1,
+            .dokki-table.results td.s2,
+            .dokki-table.results td.s3,
+            .dokki-table.results td.s {
+                text-align: center !important;
+                vertical-align: middle !important;
+                max-width: 0.85em;
+            }
+            .dokki-table.results td.s0 {
+                color: var(--dokkiCSS-page-inert-fg-color);
+            }
+            .dokki-table.results td.s1 {
+                background-color: rgba(0, 0, 0, 0.05);
+            }
+            .dokki-table.results td.s2 {
+                background-color: rgba(0, 0, 0, 0.1);
+            }
+            .dokki-table.results td.s3 {
+                background-color: rgba(0, 0, 0, 0.2);
+            }
+        </style>
+    </head>
+    <body>
+        <ths-feedback></ths-feedback>
+
+
+            <template id="dokki">
+                <dokki-document>
+                    <dokki-header>
+                        <template #caption>
+
+                Claude 3's exceptional abilities at obscure languages
+
+                        </template>
+                        <template #widgets>
+                <blog-post-widgets></blog-post-widgets>
+            </template>
+                    </dokki-header>
+                    <dokki-topics>
+
+<post-date date="7 March 2024"></post-date>
+<dokki-topic title="Claude 3&apos;s exceptional abilities at obscure languages">
+<p>Earlier this week, <a href="https://www.anthropic.com/news/claude-3-family">Anthropic launched Claude 3</a>, its next-generation family of LLMs. The models – Opus, Sonnet, and (soon-to-be-released) Haiku – have already made waves for their ability to trade blows with the previous state-of-the-art, GPT-4.</p>
+<p>In my own testing, I've found Claude 3 to be quite capable and even worthy of hype to some extent. It's not a GPT-4 killer in a general sense, but, for example, <a href="/blog/testing-a-medley-of-local-llms-for-coding/">the Opus model matches GPT-4 in common programming tasks</a>.</p>
+<dokki-subtopic title="The meat">
+<p>One standout aspect of the Claude 3 Opus model in particular is that it appears to be exceptionally good at reconstructing representations from uncommon data.</p>
+<p>In a recent blog post, <a href="/blog/llm-performance-in-retro-assembly-coding/">I found that Opus is almost twice as good as GPT-4 and nearly five times as good as GPT-4 Turbo at generating code in an obscure variety of assembly language</a>. While in theory it's possible that the difference would come from Anthropic upscaling on training data for obsolete assembly or related topics, I might side with the alternative that Opus is more generally accurate at extrapolating from limited data.</p>
+<p>User reports have also begun popping up of Opus being very capable at dealing with obscure human languages (<a href="https://www.reddit.com/r/singularity/comments/1b8603h/claude_3_opus_is_the_first_language_model_that/">for example</a>). I can confirm that Opus is able to hold a conversation in a language that has very few native speakers and at which GPT-4 fails almost completely. The model does make mistakes in this, but the output is generally reasonable and understandable.</p>
+<p>It's not clear whether this standout performance is an emergent ability or a targeted effort by Anthropic to increase the representation of languages in Claude, but to me it seems possible that the model is in fact an unusually strong extrapolator of uncommon representations. I assume we'll get a research paper or two on this at some point.</p>
+</dokki-subtopic></dokki-topic>
+
+                    </dokki-topics>
+                </dokki-document>
+            </template>
+        </body>
+</html>
diff --git a/blog/claude-3s-exceptional-abilities-at-obscure-languages/index.intermediate.html b/blog/claude-3s-exceptional-abilities-at-obscure-languages/index.intermediate.html
@@ -0,0 +1,66 @@
+<!DOCTYPE html>
+<html>
+    <head>
+        <meta name="viewport" content="width=device-width">
+        <meta http-equiv="content-type" content="text/html; charset=UTF-8">
+
+        <link rel="stylesheet" href="../+assets/blog.css">
+        <link rel="stylesheet" href="/assets/font-awesome-5-15-4/css/all.min.css">
+        <script defer src="/assets/font-awesome-5-15-4/attribution.js"></script>
+        <script defer src="../+assets/highlight.min.js"></script>
+        <script defer src="/dokki/distributable/dokki.js"></script>
+        <script type="module" src="../+assets/blog-post-widgets.js"></script>
+        <script type="module" src="../+assets/post-date.js"></script>
+
+        <style>
+            .dokki-table.results th.name {
+                writing-mode: vertical-lr;
+                font-weight: normal;
+            }
+            .dokki-table.results td:first-child {
+                white-space: pre-line;
+                min-width: 20rem;
+            }
+
+            .dokki-table.results td,
+            .dokki-table.results th {
+                width: 0px !important;
+            }
+            .dokki-table.results td.s0,
+            .dokki-table.results td.s1,
+            .dokki-table.results td.s2,
+            .dokki-table.results td.s3,
+            .dokki-table.results td.s {
+                text-align: center !important;
+                vertical-align: middle !important;
+                max-width: 0.85em;
+            }
+            .dokki-table.results td.s0 {
+                color: var(--dokkiCSS-page-inert-fg-color);
+            }
+            .dokki-table.results td.s1 {
+                background-color: rgba(0, 0, 0, 0.05);
+            }
+            .dokki-table.results td.s2 {
+                background-color: rgba(0, 0, 0, 0.1);
+            }
+            .dokki-table.results td.s3 {
+                background-color: rgba(0, 0, 0, 0.2);
+            }
+        </style>
+    </head>
+    <body>
+        <ths-feedback></ths-feedback>
+        <template dokki-document>
+            <section title>
+                Claude 3's exceptional abilities at obscure languages
+            </section>
+            <section widgets>
+                <blog-post-widgets/>
+            </section>
+            <section content>
+                <article src="content.md"></article>
+            </section>
+        </template>
+    </body>
+</html>
diff --git a/blog/index.html b/blog/index.html
@@ -143,6 +143,19 @@
                         </p>
                     </dokki-topic>
 
+                    <blog-post-abstract
+                        date="March 7, 2024"
+                        :tags="['ai']"
+                        title="Claude 3's exceptional abilities at obscure languages"
+                        brief="
+                            Earlier this week, Anthropic launched Claude 3, its next-generation family
+                            of LLMs. The models have already made waves for their ability to trade blows
+                            with the previous state-of-the-art, GPT-4. One standout aspect of Claude 3
+                            is that it appears to be exceptionally good at reconstructing representations
+                            from uncommon data.
+                        "
+                    ></blog-post-abstract>
+
                     <blog-post-abstract
                         date="March 4, 2024"
                         :tags="['ai']"
@@ -162,7 +175,7 @@
                         title="LLM performance in retro assembly coding"
                         brief="
                             How well can the various publically-available LLMs assist you in retro
-                            assembly coding? Here's eight of them compared.
+                            assembly coding? Here's nine of them compared.
                         "
                     ></blog-post-abstract>
 

diff --git a/blog/llm-performance-in-retro-assembly-coding/content.md b/blog/llm-performance-in-retro-assembly-coding/content.md
@@ -2,7 +2,7 @@
 
 # LLM performance in retro assembly coding
 
-How well can the various publically-available LLMs assist you in retro assembly coding? Here's eight of them compared.
+How well can the various publically-available LLMs assist you in retro assembly coding? Here's nine of them compared.
 
 ## Results
 
@@ -16,17 +16,18 @@ Here's how they did:
     <thead>
         <tr>
             <th>Task</th>
-            <th colspan="8">Score per task</th>
+            <th colspan="9">Score per task</th>
         </tr>
         <tr>
             <th></th>
             <th class="name">Claude 3 Opus</th>
             <th class="name">GPT-4</th>
             <th class="name">Mistral Large</th>
-            <th class="name">Phind 70B</th>
+            <th class="name">Phind-70B</th>
             <th class="name">GPT-4 Turbo</th>
             <th class="name">Gemini Ultra 1.0</th>
             <th class="name">DeepSeek Coder</th>
+            <th class="name">Claude 3 Sonnet</th>
             <th class="name">GPT-3.5</th>
         </tr>
     </thead>
@@ -41,6 +42,7 @@ Here's how they did:
             <td class="s1">25%</td>
             <td class="s0">0%</td>
             <td class="s0">0%</td>
+            <td class="s0">0%</td>
         </tr>
         <tr>
             <td>A program that paints the screen blue and prints something in the middle.</td>
@@ -52,6 +54,7 @@ Here's how they did:
             <td class="s0">0%</td>
             <td class="s0">0%</td>
             <td class="s0">0%</td>
+            <td class="s0">0%</td>
         </tr>
         <tr>
             <td>A program that draws something onto the screen in a VGA graphics mode.</td>
@@ -62,6 +65,7 @@ Here's how they did:
             <td class="s0">0%</td>
             <td class="s0">0%</td>
             <td class="s0">0%</td>
+            <td class="s1">25%</td>
             <td class="s0">0%</td>
         </tr>
         <tr>
@@ -74,6 +78,7 @@ Here's how they did:
             <td class="s1">25%</td>
             <td class="s1">25%</td>
             <td class="s0">0%</td>
+            <td class="s0">0%</td>
         </tr>
         <tr>
             <td>A program that reads mouse input and displays information about it on the screen.</td>
@@ -85,6 +90,7 @@ Here's how they did:
             <td class="s0">0%</td>
             <td class="s0">0%</td>
             <td class="s0">0%</td>
+            <td class="s0">0%</td>
         </tr>
         <tr>
             <td>A program that loads a paletted image and displays it on the screen.</td>
@@ -96,6 +102,7 @@ Here's how they did:
             <td class="s0">0%</td>
             <td class="s0">0%</td>
             <td class="s0">0%</td>
+            <td class="s0">0%</td>
         </tr>
     </tbody>
 </table>
@@ -125,7 +132,7 @@ Below are the average test scores per model across all tasks:
             <td>17%</td>
         </tr>
         <tr>
-            <td>Phind 70B</td>
+            <td>Phind-70B</td>
             <td>13%</td>
         </tr>
         <tr>
@@ -140,6 +147,10 @@ Below are the average test scores per model across all tasks:
             <td>DeepSeek Coder</td>
             <td>4%</td>
         </tr>
+        <tr>
+            <td>Claude 3 Sonnet</td>
+            <td>4%</td>
+        </tr>
         <tr>
             <td>GPT-3.5</td>
             <td>0%</td>