You will see that I have posted about this before asking for suggestions on which software I can use to convert PDF to docx/odt.
I am a teacher. During my time as a researcher I wrote a lot of documents and regularly draw upon them to teach my students. I often have to take the text, modify them, or build upon them. A lot of my material is bound up in PDFs. Sometimes, I have grant applications to write where a previous draft I wrote was stored as a PDF. Converting them to text has become the bane of my life.
I am forced to use online tools because none of the software I have seem to do the trick. Lot of people keep saying pandoc. Pandoc does not convert PDF to any other format. It can only be the output format.
Is there a magic open source solution that I have missed out?
The problem lies in the PDFs themselves. In there are objects that represent lines of glyphs. If you are lucky. A conversion tool can guess which of those lines belong together and produce the text.
It cannot know any intentions behind it, though. Take a numbered list. The first line is two line objects: the number plus the . or the ), and the first line of text. The conversion tool can now guess. As the line blocks with the numbers are all left of the line blocks with text, this could be a numbered list. Or it could be a table with two columns. Nothing in the PDF is giving any hints.
And that is the easy part. This assumes that the document either uses default fonts, or keeps its embedded fonts untouched. If they use embedded fonts and a PDF optimizer that only embeds the used characters and renumbers them, any copy or conversion tool is bound to fail.
Same with protected PDFs where you simply cannot copy the text from the start.
And then there are PDFs that just consist of scanned pages. Here you would need an OCR software to get something readable out of them.
PDF is an archival, output format, the end of a process. Not something to work from.
Always preserve the original file. Keep it safe. If you change tools, make sure you have a conversion path into something editable. The PDF is for giving away, nothing else.
Renumbering characters during font minimization? I haven’t encountered that, it would break searching and copying.
Anyway, PDFs for example don’t even say whether a line of text is left, center or justified – they usually store the coordinates of the first character and then spacing to each subsequent one unless defined by the font.
And what if the document contains text boxes, or other Word objects? Well, the text is separate from the underlying rectangle (if there is one) and it’s up to the conversion tool to guess if it’s part of the main text layer.
Sorry, it’s really hard to edit PDFs. You might want to use Inkscape for editing the graphical parts. If you also need to edit paragraphs, I suggest recreating the document by pasting them into Word/LibreOffice, and importing any graphical shapes as SVGs (use Inkscape for the conversion, then you can try Word’s “Graphic > Convert to Shapes” feature).
Really, every software that outputs PDF should treat it as an export process, hopefully making it clearer that “saving as PDF” is visually lossless but structurally lossy and messy.
The only real solution is to always keep your source files. PDFs are not intended to be edited.
There’s no any solution. It is impossible to convert from PDF to any editable format correctly. The exception is a “hybrid PDF” that has an embedded editable document. If you need to edit PDFs that you created yourself, store them in hybrid format.
I know the pain. While there are definitely solutions that work sometimes, there’s just no “one size fits all” that I’m aware of. PDFs can represent text very differently internally.
What I did for one project where extracting the text produced a complete mess was to convert the PDF pages to images and then OCR them…
StirlingPDF is basically 1 size fits all.
Interesting, I’ll keep it in mind next time I have to deal with this problem (hopefully never but who knows).
A few years ago I was in contact with researchers that were developing an AI tool to parse PDFs (I think they didn’t care about converting to editable formats, but extracting data), from their material I got the impression that it’s extremely difficult to do right using traditional algorithms.
StirlingPDF does this. I’ll dm you the one I host for my writing group.
I haven’t tested that part of it yet, but the self-hostable StirlingPDF offers conversion from PDF to a number of formats.
The rest I use it for works fine, so maybe that could be an option.
Maybe LibreOffice Draw can help you out? It has PDF editing capabilities
If you ever need to edit a PDF that way, just use Inkscape. It is way better than LO draw for that.
https://pdf2docx.readthedocs.io/ seems to fit the bill. I can’t vouch for it.
PDF is such a curse. I say this as a person currently tasked with deploying new mysteriously complex enterprise PDF conversion software for technical documents. The rabbit hole is so deep.
It is not a curse. It does exactly what it is intended to do: Create an archive of a document that is universally reproduceable.
It is a very well designed cul-de-sac for exactly this purpose. Using it for anything else is calling for trouble.
It’s a curse because it’s used for things other than what it’s intended to. It’s doing a good job representing printed material, but unfortunately people very commonly expect it to be something more akin to a word processor file.
This is probably my first time ever using it for an appropriate purpose as this team’s technical docs are destined for the press (and digital distribution). They just have no idea how to software, so I was brought in to build bridges between and ultimately simplify all their tools.
As a dev the reason pdf is so strange is because it’s a compound format. It can be just images strung together. It can also be pure text with fonts, ect…etc …
If you open the file as a text file, you can see this. It’s many different formats in a trenchcoat.
Yeah, also a dev here. I’d be so happy if they’d parted ways with the 90s legacy bits at some point. Just glad there are enough parsing libraries that I’ll never need to care (right? Please tell me I’m right!).
I hope your right too lol.
Would an alternative be to simply edit the pdfs?
The german software FlexiPDF still allows you to buy a yearly version for a one off sum and allow you to use a free trial with watermark to check whether it works well enough for you before you buy.
https://www.softmaker.com/en/products/flexipdf