Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Starlight llms.txt plugin #10819

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from
Draft

Add Starlight llms.txt plugin #10819

wants to merge 7 commits into from

Conversation

delucis
Copy link
Member

@delucis delucis commented Jan 28, 2025

Description (required)

Experimental PR that adds https://delucis.github.io/starlight-llms-txt/ with some configuration to see if the result could be helpful to anyone.

Feedback very welcome from folks using tools like this.

The entrypoint llms.txt can be previewed at https://deploy-preview-10819--astro-docs-2.netlify.app/llms.txt

Copy link

netlify bot commented Jan 28, 2025

Deploy Preview for astro-docs-2 ready!

Name Link
🔨 Latest commit 15a5f0d
🔍 Latest deploy log https://app.netlify.com/sites/astro-docs-2/deploys/67991620c740a40008767b3d
😎 Deploy Preview https://deploy-preview-10819--astro-docs-2.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@ArmandPhilippot
Copy link
Member

I don't know if the build issue is related, but the error reminds me of withastro/astro#12669 if that helps. We might need to bump Astro to >=5.1.5 (the version where the fix was released). It was related to the Container API but I saw some others changes in the core package so maybe this could fix the issue.

@delucis
Copy link
Member Author

delucis commented Jan 28, 2025

I don't know if the build issue is related, but the error reminds me of withastro/astro#12669 if that helps. We might need to bump Astro to >=5.1.5 (the version where the fix was released). It was related to the Container API but I saw some others changes in the core package so maybe this could fix the issue.

Ah thank you very much for the pointer @ArmandPhilippot! I’ll try upgrading.

@stargazer33
Copy link

Guys, I tried this text file https://deploy-preview-10819--astro-docs-2.netlify.app/llms-full.txt to create .epub and .fb2 documents readable on my E-Ink device.
Short version: it works.
llms-full.tar.gz

.epub files attached (github.com does not like .epub attachments, therefore archived)

@stargazer33
Copy link

stargazer33 commented Feb 3, 2025

How to get that .epub files:

  1. Ubuntu 22.04

  2. Install the latest pandoc, see https://pandoc.org/

  3. Check pandoc version:

pandoc --version
pandoc 3.6.2
  1. Edit llms-full.txt. Delete the first line containing
<SYSTEM>This is the full developer documentation for Astro</SYSTEM>

Save as llms-full.1.txt

  1. In the command-line:
pandoc --standalone --list-of-figures=false --list-of-tables=false --embed-resources=true --table-of-contents --toc-depth=2 -f gfm -t epub llms-full.1.txt -o llms-full.5.epub

pandoc --standalone --list-of-figures=false --list-of-tables=false --embed-resources=true --table-of-contents --toc-depth=2 -f gfm -t epub3 llms-full.1.txt -o llms-full.epub3.epub
  1. Upload the .epub files to your E-Reader.

  2. Read. Mind the incorrect Table of Contents.

P.S.

Usage of pandoc described here:
https://gist.github.com/caseywatts/3d8150fe04e0d8462cfc4d51b9856d39

@stargazer33
Copy link

stargazer33 commented Feb 3, 2025

s1
s2
s3
s4

Example of incorrect Table of Contents attached.
(Screenshots from FBReader 2.0.5 on Ubuntu. KOReader v2024.07 on Android shows the same)

Generating these EPUB files I specified:

--toc-depth=2 

this means: only top levels 1-2.
These headers are definitely deeper than level 1-2 !

Please fix.

@stargazer33
Copy link

a1
a2

Attached - Astro documentation on E-Ink device. Just to demonstrate that it works.

@ArmandPhilippot
Copy link
Member

Nice!

But, I don't think there is an issue? Looking at your screenshots and by quickly looking at the table of contents (using Calibre) I think all the headings are h1 or h2 (e.g. Routing reference, you only have prerender, partial and getStaticPaths and no deeper headings). Did you have a specific example in mind?

@stargazer33
Copy link

stargazer33 commented Feb 3, 2025

Did you have a specific example in mind?

Look at first black-blue screenshot.

317. Build your first Astro Blog

May be this should be level 2.
But not the headers after it !!!

I mean 318, 319, 320 should be under 317 -- NOT on the same level.

Not these "Checklist", "What do I need to get started"

Level 1 - these are the chapters like

Components
Pages
Layouts

It looks like first few pages of TOC are OK.
But the last pages are definitely messed up.
Not sure about the middle

@ArmandPhilippot
Copy link
Member

Oh, I see the confusion now, but technically this is correct.

The table of contents is generated using the <h1> and <h2> present on each page of the website. So, Build your first Astro Blog is indeed a page with <h1>Build your first Astro Blog</h1> and <h2>Checklist</h2>. So this exactly what the output is: "Build your first Astro Blog" is level 1 and "Checklist" level 2. Each page is placed end to end.

It does not merge all the pages of the tutorial under a Tutorial heading (for example) because these are individual pages with their own first level heading.

@stargazer33
Copy link

stargazer33 commented Feb 3, 2025

section-titled
One more problem: Watch for these "Section titled [section name] " under each header.
They are completely redundant and only distract, so pls remove them...

@stargazer33
Copy link

stargazer33 commented Feb 3, 2025

Oh, I see the confusion now, but technically this is correct.

Imagine an LLM reading all this stuff. After this - the LLM can produce strange responses based on these headers.

Yes, these headers are H1 headers on some pages, but this is just an implementation issue.
Probably you need some additional processing step to adjust H headers when generating one big .txt file...

@stargazer33
Copy link

stargazer33 commented Feb 3, 2025

@ArmandPhilippot @delucis let me know where/how to report the things I see when reading the .epub
I think, if it's not OK for human - it is not OK for LLM too

@ArmandPhilippot
Copy link
Member

ArmandPhilippot commented Feb 3, 2025

Yeah I also noticed the Section titled... which add a bit of noise. However this might be a bit tricky to remove because the llms.txt feature comes from a plugin for Starlight (used as framework for the docs) while the Section titled... comes from the docs directly (this repository). This corresponds to the links in the titles (the anchor icon you see below):
heading-example

Imagine an LLM reading all this stuff. After this - the LLM can produce strange responses based on these headers.

I don't know how LLMs work so I can't comment on that. I was reacting because I thought you were saying there was a bug in the way it parses pages. I don't know if it can be improved, I haven't read the plugin code to generate these files. Maybe Chris will have an idea… But in any case your feedback is appreciated!

@stargazer33
Copy link

stargazer33 commented Feb 4, 2025

One more update: I am reading this .epub - and when I click on links - nothing happens!
OK, some links are external (www.something...), and the reader software does not follow external links.
BUT internal links does not works too, see attached screenshot -> I tried to click on "Pages"

It happens on E-Reader on PC.
does-not-follow-links

@sarah11918
Copy link
Member

Am just adding a comment from the Discord for record here in this PR that the intention and scope of THIS PR is to generate the llms.txt file for use by LLMs. Making available and rendering of content in other formats such as .epub is a nice side effect of having the full text available, but not the intention/direct concern of this PR.

Knowing about some issues for other use cases may in fact highlight issues relevant to the intended use case, so feedback is certainly valuable! But just pointing out that we don't have to solve those issues in this PR as generating content for static publication formats like .epub and .pdf aren't the priority here. (As nice as it would be to statically reproduce Astro Docs for other publication, our content changes so rapidly that having a "hard" digital copy isn't a priority for us at the moment.)

@stargazer33
Copy link

stargazer33 commented Feb 4, 2025

I converted this text file https://deploy-preview-10819--astro-docs-2.netlify.app/llms-small.txt to epub, see attached archive.

It seems, part of the Table of Contents is not OK.
On the attached screenshot: look at "Add Integrations" and look at long list of chapters following and starting with @ symbol.
Something went wrong here...

toc

llms-small.txt.epub.tar.gz

@delucis
Copy link
Member Author

delucis commented Feb 4, 2025

Thank you for the feedback @stargazer33!

Quick notes:

  • Regarding links: I’m not 100% sure how e-readers would expect to do things. Currently these just use the links to pages we have in docs without any manipulation. I guess it would require reworking all links to be only anchor links within the current document for e-readers? But I’m not sure and as mentioned above the aim was not for EPUB to work in this case.

  • Regarding headings: as @ArmandPhilippot noted, these mirror our page structure and I think that may be OK? Ideally I guess it could be nice to group all tutorial pages somehow, but for LLMs that is already accomplished via the dedicated file linked from llms.txt: https://deploy-preview-10819--astro-docs-2.netlify.app/_llms-txt/build-a-blog-tutorial.txt

  • It would definitely be nice to fix the extra “Section titled…” text. That’s coming from the accessible labels for our heading anchors and I’d missed them when taking a look at the output. I’ll have to have a think how best to support removing something like that — currently the HTML => Markdown processing is entirely done by the starlight-llms-txt plugin and doesn’t provide a way for a user (e.g. this repo) to specify additional processing steps.

@stargazer33
Copy link

stargazer33 commented Feb 4, 2025

* Regarding links: I’m not 100% sure how e-readers would expect to do things.

Well, just look at attached screenshots.
One screenshot is from llms-full.txt (2.2 MB, complete documentation)
another screenshot from llms-small.txt

l1
l2

You can not have a link like this /en/basics/astro-pages/ in a self-contained document.

@stargazer33
Copy link

stargazer33 commented Feb 4, 2025

* Regarding headings: as @ArmandPhilippot noted, these mirror our page structure and I think that may be OK?

As as a user I find the current table of contents incorrect and unreadable, thats it.
And no, as a user I do not care about implementation details.

And some LLM is also kind of a "user". Just it does not complains, it will just hallucinate based on incorrect information.

P.S. When, in which ticket/PR it should be fixed - I do not know, here I am in user role )) I just writing down what I see ))

@stargazer33
Copy link

stargazer33 commented Feb 5, 2025

Some of the *.txt listed here https://deploy-preview-10819--astro-docs-2.netlify.app/llms.txt
have invalid formatting.
I checked:
https://deploy-preview-10819--astro-docs-2.netlify.app/_llms-txt/how-to-recipes.txt - formatting broken
https://deploy-preview-10819--astro-docs-2.netlify.app/_llms-txt/build-a-blog-tutorial.txt - - formatting broken
https://deploy-preview-10819--astro-docs-2.netlify.app/llms-small.txt - formatting broken

On the other hand
https://deploy-preview-10819--astro-docs-2.netlify.app/llms-full.txt - formatting OK

I used exactly the same pandoc command (shown above) to create .epub files and the same software to view them.
So - the problem is not in the .epub converter, but in the .txt files themselves

![how-to-recipes](https://github.com/user-attachments/assets/5d016b29-7d0
build-a-blog
llms-small
c-43a5-a60d-fe8a9b501b77)

llms-full-formatting-ok

@delucis
Copy link
Member Author

delucis commented Feb 5, 2025

Some of the *.txt listed here deploy-preview-10819--astro-docs-2.netlify.app/llms.txt
have invalid formatting.

Ah yeah, that’s expected. In order to compress these files for LLMs they have whitespace collapsed. Similar to https://svelte.dev/llms-small.txt for example from the Svelte docs. So they’re no longer valid Markdown.

@delucis delucis mentioned this pull request Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants