You will see that I have posted about this before asking for suggestions on which software I can use to convert PDF to docx/odt.
I am a teacher. During my time as a researcher I wrote a lot of documents and regularly draw upon them to teach my students. I often have to take the text, modify them, or build upon them. A lot of my material is bound up in PDFs. Sometimes, I have grant applications to write where a previous draft I wrote was stored as a PDF. Converting them to text has become the bane of my life.
I am forced to use online tools because none of the software I have seem to do the trick. Lot of people keep saying pandoc. Pandoc does not convert PDF to any other format. It can only be the output format.
Is there a magic open source solution that I have missed out?
Renumbering characters during font minimization? I haven’t encountered that, it would break searching and copying.
Anyway, PDFs for example don’t even say whether a line of text is left, center or justified – they usually store the coordinates of the first character and then spacing to each subsequent one unless defined by the font.
And what if the document contains text boxes, or other Word objects? Well, the text is separate from the underlying rectangle (if there is one) and it’s up to the conversion tool to guess if it’s part of the main text layer.
Sorry, it’s really hard to edit PDFs. You might want to use Inkscape for editing the graphical parts. If you also need to edit paragraphs, I suggest recreating the document by pasting them into Word/LibreOffice, and importing any graphical shapes as SVGs (use Inkscape for the conversion, then you can try Word’s “Graphic > Convert to Shapes” feature).
Really, every software that outputs PDF should treat it as an export process, hopefully making it clearer that “saving as PDF” is visually lossless but structurally lossy and messy.