Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting thousands of snippets #7410

Closed
teucer opened this issue Jun 26, 2021 · 23 comments
Closed

Converting thousands of snippets #7410

teucer opened this issue Jun 26, 2021 · 23 comments

Comments

@teucer
Copy link

teucer commented Jun 26, 2021

I have large json files containing markdown snippets. I want to convert each snippet to LaTeX.

  • I could invoke pandoc in a loop to process them, but that is slow.

  • I could artificially concatenate the snippets into a markdown file and then convert, but would need to parse the output back to json.

Are there any alternatives? If not, I think this is a valuable feature to have.

@alerque
Copy link
Contributor

alerque commented Jun 26, 2021

I'm sure the fastest way is using the Haskell API, but how slow is it really to invoke pandoc in a loop? I just tried throwing 1000 markdown snippets at it and got about 100 LaTeX results back per second. I suspect the ergonomics of coding this up will trump any time savings you might have unless you do hundreds of thousands on a regular basis. The Python wrapper by comparison is ergonomic to use even relative to a shell loop, but it's also about 3 times slower.

@teucer
Copy link
Author

teucer commented Jun 26, 2021

This is for a user facing application, 100 snippets per second could be considered as slow imho.

I believe most of the time is spent loading pandoc (maybe some caching is done?) between loops. I explored keeping pandoc in memory with Python subprocess, but there is no way to tell pandoc that the end of input has been reached.

@alerque
Copy link
Contributor

alerque commented Jun 26, 2021

Yes I'm sure the process overhead is substantial. Something compiled from Haskell to iterate over your json and do the conversions using Pandoc as a library should be able to sidestep all that and chew through large inputs pretty fast.

@teucer
Copy link
Author

teucer commented Jun 26, 2021

Using the Haskell API seems to be the right approach. I do not have any experience in Haskell though. Hence the feature request: I believe it would be beneficial to be able to run pandoc in a request/reply setup to avoid the process overhead.

@alerque
Copy link
Contributor

alerque commented Jun 27, 2021

to be able to run pandoc in a request/reply setup

What do you expect a "request/reply setup" to actually look like?

Running as a daemon listening on a TCP port with an API that you can pass it questions and get answers? If so I would suggest that's actually a cool idea, but I would be opposed to it being included in pandoc itself. Such thing bolted on to a tool with a specific workflow would be feature creep / bloat and have a conflicting release cadence, etc.

My suggestion is such a "feature" should be housed in its own project as an app that uses Pandoc's libraries but also loads up the necessary TCP port or socket handling and provides a REST or similar API that maps to Pandoc's provided API. Such a project could iterate much faster than Pandoc itself, release on it's own cadence, not worry about experimenting until you get the right ergonomics without needing to do long term support of release because it shipped in Pandoc, would keep the already onerous dependency tree and compile times down, make it easier to maintain and document both parts, and so on.

@teucer
Copy link
Author

teucer commented Jun 27, 2021

I have some corporate constraints and need to support multiple platforms. In some instances ports are blocked, in some other instances I need to support Windows (sockets would not work). I had a REPL in mind:

  1. this the end of the input
  2. give back the output so far
  3. wait for further input

Maybe there are easier ways to achieve this. If not, I guess your idea is the only practical option.

@alerque
Copy link
Contributor

alerque commented Jun 29, 2021

I've never seen an environment that blocks localhost ports, so I still think my suggestion would fit (you don't have to expose a port to other machines, and it doesn't have to be a privileged port).

That being said a REPL would certainly be another option, but the mechanics of query-response and the design of the API would be largely the same as for a port or socket based API listener. All 3 could be provided by the same project. I still hold that none of the 3 fit well inside this project but would be a great stand alone project based on the library version.

@teucer
Copy link
Author

teucer commented Jun 29, 2021

Then a REST API running on localhost is probably the better approach.

I imagine it would need to be developed in Haskell. Then it can be compiled into an executable and can support some basic parameters like host, port and logging.

The real issue on my side is the lack of Haskell experience...

@jgm
Copy link
Owner

jgm commented Jun 30, 2021

I think such a server would be surprisingly easy to code.
IF I have time, I'll try to write a simple proof of concept.

@mb21
Copy link
Collaborator

mb21 commented Jun 30, 2021

For the use case of "thousands of snippets", you'd probably want the REST api to accept a batch of conversions in a single POST request, otherwise the networking overhead will be similar to the process overhead.

@jgm
Copy link
Owner

jgm commented Jun 30, 2021

Here's an example to get you started. It may need some customization for your needs:

https://github.com/jgm/pandoc-server

@jgm jgm closed this as completed Jun 30, 2021
@jgm
Copy link
Owner

jgm commented Jun 30, 2021

@mb21 that's a good thought; I've added a convert-batch endpoint which allows you to do many conversions at once.

@alerque
Copy link
Contributor

alerque commented Jun 30, 2021

Wow, that's actually pretty cool and has the potential to be a game changer for application that do a lot of small conversions and use other libraries just to be light weight.

@fmoralesc We might consider this as a way to access the AST using the new CommonMark parser with source positions to speed up syntax highlighting! (cf. vim-pandoc-syntax #300)

@teucer
Copy link
Author

teucer commented Jun 30, 2021

@jgm That was quick, thank you!

@lassik
Copy link

lassik commented Aug 25, 2021

@jgm Could the pandoc command line tool support a mode where it reads an (uncompressed) tar file with any number of documents, and outputs another tar file containing the converted versions of those documents? Tar is a simple and ubiquitous format, and can also be read from stdin / written to stdout for people who don't want to use tempfiles.

@jgm
Copy link
Owner

jgm commented Aug 25, 2021

No, I think it's better not to build into pandoc things that can easily be done with shell scripting.

@lassik
Copy link

lassik commented Sep 1, 2021

@jgm Performance, not convenience, would be the main draw.

I did some measurements using a collection of about 100 Markdown files, totaling about 3000 lines of text.

The following loop takes 3.5 seconds on my computer:

for md in *.md; do pandoc --from gfm --to json "$md" >/dev/null; done

By contrast, sending the same documents as pre-prepared JSON to the pandoc-server convert-batch endpoint, the conversion takes less than 0.5 seconds. This shows that pandoc startup time dominates the runtime of the above shell script.

The slowdown could be eliminated without incurring the complexity of starting a HTTP server in the background, if the pandoc tool itself spoke tar.

@mb21
Copy link
Collaborator

mb21 commented Sep 2, 2021

startup time dominates the runtime of the above shell script.

yes, that's why jgm kindly put together https://github.com/jgm/pandoc-server

The slowdown could be eliminated without incurring the complexity of starting a HTTP server in the background, if the pandoc tool itself spoke tar.

that's just moving the complexity from one place to another though...

@lassik
Copy link

lassik commented Sep 2, 2021

For local use, a pipeline is less complex and more reliable than a web server. No temp files, no sockets, no HTTP. I can write my own pandoc-tar next to pandoc-server, but I imagine others doing batch conversions would find the feature useful as well, and it would be a natural extension of the pipeline calling convention that pandoc already does for single documents.

@jgm
Copy link
Owner

jgm commented Sep 2, 2021

I think the proposed feature adds too much complexity and confusion.
With pandoc my.tar -t html you wouldn't get HTML output, you'd get a tar archive of HTML files.
It just makes the whole interface less consistent. Feel free to develop a pandoc-tar tool, though.

@lassik
Copy link

lassik commented Sep 2, 2021

Fair enough. I'm satisfied with this outcome, and will write a separate pandoc-tar command. Thanks for considering the feature anyway!

@lassik
Copy link

lassik commented Sep 4, 2021

There is now a pandoc-tar tool and I'm successfully using it for real work by calling it as a subprocess from Scheme. The tool is very bare-bones and I'm a Haskell noob, so in case others find it useful contributions are very welcome.

@jgm
Copy link
Owner

jgm commented Sep 4, 2021

Excellent!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants