Skip to content

Commit

Permalink
Update files
Browse files Browse the repository at this point in the history
  • Loading branch information
leikareipa committed Mar 7, 2024
1 parent f0589e2 commit 1c0a681
Show file tree
Hide file tree
Showing 7 changed files with 214 additions and 9 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
<post-date date="7 March 2024"/>

# Claude 3's exceptional abilities at obscure languages

Earlier this week, [Anthropic launched Claude 3](https://www.anthropic.com/news/claude-3-family), its next-generation family of LLMs. The models &ndash; Opus, Sonnet, and (soon-to-be-released) Haiku &ndash; have already made waves for their ability to trade blows with the previous state-of-the-art, GPT-4.

In my own testing, I've found Claude 3 to be quite capable and even worthy of hype to some extent. It's not a GPT-4 killer in a general sense, but, for example, [the Opus model matches GPT-4 in common programming tasks](/blog/testing-a-medley-of-local-llms-for-coding/).

## The meat

One standout aspect of the Claude 3 Opus model in particular is that it appears to be exceptionally good at reconstructing representations from uncommon data.

In a recent blog post, [I found that Opus is almost twice as good as GPT-4 and nearly five times as good as GPT-4 Turbo at generating code in an obscure variety of assembly language](/blog/llm-performance-in-retro-assembly-coding/). While in theory it's possible that the difference would come from Anthropic upscaling on training data for obsolete assembly or related topics, I might side with the alternative that Opus is more generally accurate at extrapolating from limited data.

User reports have also begun popping up of Opus being very capable at dealing with obscure human languages ([for example](https://www.reddit.com/r/singularity/comments/1b8603h/claude_3_opus_is_the_first_language_model_that/)). I can confirm that Opus is able to hold a conversation in a language that has very few native speakers and at which GPT-4 fails almost completely. The model does make mistakes in this, but the output is generally reasonable and understandable.

It's not clear whether this standout performance is an emergent ability or a targeted effort by Anthropic to increase the representation of languages in Claude, but to me it seems possible that the model is in fact an unusually strong extrapolator of uncommon representations. I assume we'll get a research paper or two on this at some point.
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
<!DOCTYPE html>
<html>
<head>
<meta name="viewport" content="width=device-width">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">

<link rel="stylesheet" href="../+assets/blog.css">
<link rel="stylesheet" href="/assets/font-awesome-5-15-4/css/all.min.css">
<script defer src="/assets/font-awesome-5-15-4/attribution.js"></script>
<script defer src="../+assets/highlight.min.js"></script>
<script defer src="/dokki/distributable/dokki.js"></script>
<script type="module" src="../+assets/blog-post-widgets.js"></script>
<script type="module" src="../+assets/post-date.js"></script>

<style>
.dokki-table.results th.name {
writing-mode: vertical-lr;
font-weight: normal;
}
.dokki-table.results td:first-child {
white-space: pre-line;
min-width: 20rem;
}

.dokki-table.results td,
.dokki-table.results th {
width: 0px !important;
}
.dokki-table.results td.s0,
.dokki-table.results td.s1,
.dokki-table.results td.s2,
.dokki-table.results td.s3,
.dokki-table.results td.s {
text-align: center !important;
vertical-align: middle !important;
max-width: 0.85em;
}
.dokki-table.results td.s0 {
color: var(--dokkiCSS-page-inert-fg-color);
}
.dokki-table.results td.s1 {
background-color: rgba(0, 0, 0, 0.05);
}
.dokki-table.results td.s2 {
background-color: rgba(0, 0, 0, 0.1);
}
.dokki-table.results td.s3 {
background-color: rgba(0, 0, 0, 0.2);
}
</style>
</head>
<body>
<ths-feedback></ths-feedback>


<template id="dokki">
<dokki-document>
<dokki-header>
<template #caption>

Claude 3's exceptional abilities at obscure languages

</template>
<template #widgets>
<blog-post-widgets></blog-post-widgets>
</template>
</dokki-header>
<dokki-topics>

<post-date date="7 March 2024"></post-date>
<dokki-topic title="Claude 3&apos;s exceptional abilities at obscure languages">
<p>Earlier this week, <a href="https://www.anthropic.com/news/claude-3-family">Anthropic launched Claude 3</a>, its next-generation family of LLMs. The models – Opus, Sonnet, and (soon-to-be-released) Haiku – have already made waves for their ability to trade blows with the previous state-of-the-art, GPT-4.</p>
<p>In my own testing, I've found Claude 3 to be quite capable and even worthy of hype to some extent. It's not a GPT-4 killer in a general sense, but, for example, <a href="/blog/testing-a-medley-of-local-llms-for-coding/">the Opus model matches GPT-4 in common programming tasks</a>.</p>
<dokki-subtopic title="The meat">
<p>One standout aspect of the Claude 3 Opus model in particular is that it appears to be exceptionally good at reconstructing representations from uncommon data.</p>
<p>In a recent blog post, <a href="/blog/llm-performance-in-retro-assembly-coding/">I found that Opus is almost twice as good as GPT-4 and nearly five times as good as GPT-4 Turbo at generating code in an obscure variety of assembly language</a>. While in theory it's possible that the difference would come from Anthropic upscaling on training data for obsolete assembly or related topics, I might side with the alternative that Opus is more generally accurate at extrapolating from limited data.</p>
<p>User reports have also begun popping up of Opus being very capable at dealing with obscure human languages (<a href="https://www.reddit.com/r/singularity/comments/1b8603h/claude_3_opus_is_the_first_language_model_that/">for example</a>). I can confirm that Opus is able to hold a conversation in a language that has very few native speakers and at which GPT-4 fails almost completely. The model does make mistakes in this, but the output is generally reasonable and understandable.</p>
<p>It's not clear whether this standout performance is an emergent ability or a targeted effort by Anthropic to increase the representation of languages in Claude, but to me it seems possible that the model is in fact an unusually strong extrapolator of uncommon representations. I assume we'll get a research paper or two on this at some point.</p>
</dokki-subtopic></dokki-topic>

</dokki-topics>
</dokki-document>
</template>
</body>
</html>
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
<!DOCTYPE html>
<html>
<head>
<meta name="viewport" content="width=device-width">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">

<link rel="stylesheet" href="../+assets/blog.css">
<link rel="stylesheet" href="/assets/font-awesome-5-15-4/css/all.min.css">
<script defer src="/assets/font-awesome-5-15-4/attribution.js"></script>
<script defer src="../+assets/highlight.min.js"></script>
<script defer src="/dokki/distributable/dokki.js"></script>
<script type="module" src="../+assets/blog-post-widgets.js"></script>
<script type="module" src="../+assets/post-date.js"></script>

<style>
.dokki-table.results th.name {
writing-mode: vertical-lr;
font-weight: normal;
}
.dokki-table.results td:first-child {
white-space: pre-line;
min-width: 20rem;
}

.dokki-table.results td,
.dokki-table.results th {
width: 0px !important;
}
.dokki-table.results td.s0,
.dokki-table.results td.s1,
.dokki-table.results td.s2,
.dokki-table.results td.s3,
.dokki-table.results td.s {
text-align: center !important;
vertical-align: middle !important;
max-width: 0.85em;
}
.dokki-table.results td.s0 {
color: var(--dokkiCSS-page-inert-fg-color);
}
.dokki-table.results td.s1 {
background-color: rgba(0, 0, 0, 0.05);
}
.dokki-table.results td.s2 {
background-color: rgba(0, 0, 0, 0.1);
}
.dokki-table.results td.s3 {
background-color: rgba(0, 0, 0, 0.2);
}
</style>
</head>
<body>
<ths-feedback></ths-feedback>
<template dokki-document>
<section title>
Claude 3's exceptional abilities at obscure languages
</section>
<section widgets>
<blog-post-widgets/>
</section>
<section content>
<article src="content.md"></article>
</section>
</template>
</body>
</html>
15 changes: 14 additions & 1 deletion blog/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,19 @@
</p>
</dokki-topic>

<blog-post-abstract
date="March 7, 2024"
:tags="['ai']"
title="Claude 3's exceptional abilities at obscure languages"
brief="
Earlier this week, Anthropic launched Claude 3, its next-generation family
of LLMs. The models have already made waves for their ability to trade blows
with the previous state-of-the-art, GPT-4. One standout aspect of Claude 3
is that it appears to be exceptionally good at reconstructing representations
from uncommon data.
"
></blog-post-abstract>

<blog-post-abstract
date="March 4, 2024"
:tags="['ai']"
Expand All @@ -162,7 +175,7 @@
title="LLM performance in retro assembly coding"
brief="
How well can the various publically-available LLMs assist you in retro
assembly coding? Here's eight of them compared.
assembly coding? Here's nine of them compared.
"
></blog-post-abstract>

Expand Down
19 changes: 15 additions & 4 deletions blog/llm-performance-in-retro-assembly-coding/content.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

# LLM performance in retro assembly coding

How well can the various publically-available LLMs assist you in retro assembly coding? Here's eight of them compared.
How well can the various publically-available LLMs assist you in retro assembly coding? Here's nine of them compared.

## Results

Expand All @@ -16,17 +16,18 @@ Here's how they did:
<thead>
<tr>
<th>Task</th>
<th colspan="8">Score per task</th>
<th colspan="9">Score per task</th>
</tr>
<tr>
<th></th>
<th class="name">Claude 3 Opus</th>
<th class="name">GPT-4</th>
<th class="name">Mistral Large</th>
<th class="name">Phind 70B</th>
<th class="name">Phind-70B</th>
<th class="name">GPT-4 Turbo</th>
<th class="name">Gemini Ultra 1.0</th>
<th class="name">DeepSeek Coder</th>
<th class="name">Claude 3 Sonnet</th>
<th class="name">GPT-3.5</th>
</tr>
</thead>
Expand All @@ -41,6 +42,7 @@ Here's how they did:
<td class="s1">25%</td>
<td class="s0">0%</td>
<td class="s0">0%</td>
<td class="s0">0%</td>
</tr>
<tr>
<td>A program that paints the screen blue and prints something in the middle.</td>
Expand All @@ -52,6 +54,7 @@ Here's how they did:
<td class="s0">0%</td>
<td class="s0">0%</td>
<td class="s0">0%</td>
<td class="s0">0%</td>
</tr>
<tr>
<td>A program that draws something onto the screen in a VGA graphics mode.</td>
Expand All @@ -62,6 +65,7 @@ Here's how they did:
<td class="s0">0%</td>
<td class="s0">0%</td>
<td class="s0">0%</td>
<td class="s1">25%</td>
<td class="s0">0%</td>
</tr>
<tr>
Expand All @@ -74,6 +78,7 @@ Here's how they did:
<td class="s1">25%</td>
<td class="s1">25%</td>
<td class="s0">0%</td>
<td class="s0">0%</td>
</tr>
<tr>
<td>A program that reads mouse input and displays information about it on the screen.</td>
Expand All @@ -85,6 +90,7 @@ Here's how they did:
<td class="s0">0%</td>
<td class="s0">0%</td>
<td class="s0">0%</td>
<td class="s0">0%</td>
</tr>
<tr>
<td>A program that loads a paletted image and displays it on the screen.</td>
Expand All @@ -96,6 +102,7 @@ Here's how they did:
<td class="s0">0%</td>
<td class="s0">0%</td>
<td class="s0">0%</td>
<td class="s0">0%</td>
</tr>
</tbody>
</table>
Expand Down Expand Up @@ -125,7 +132,7 @@ Below are the average test scores per model across all tasks:
<td>17%</td>
</tr>
<tr>
<td>Phind 70B</td>
<td>Phind-70B</td>
<td>13%</td>
</tr>
<tr>
Expand All @@ -140,6 +147,10 @@ Below are the average test scores per model across all tasks:
<td>DeepSeek Coder</td>
<td>4%</td>
</tr>
<tr>
<td>Claude 3 Sonnet</td>
<td>4%</td>
</tr>
<tr>
<td>GPT-3.5</td>
<td>0%</td>
Expand Down
Loading

0 comments on commit 1c0a681

Please sign in to comment.