You will see that I have posted about this before asking for suggestions on which software I can use to convert PDF to docx/odt.

I am a teacher. During my time as a researcher I wrote a lot of documents and regularly draw upon them to teach my students. I often have to take the text, modify them, or build upon them. A lot of my material is bound up in PDFs. Sometimes, I have grant applications to write where a previous draft I wrote was stored as a PDF. Converting them to text has become the bane of my life.

I am forced to use online tools because none of the software I have seem to do the trick. Lot of people keep saying pandoc. Pandoc does not convert PDF to any other format. It can only be the output format.

Is there a magic open source solution that I have missed out?

  • observantTrapezium@lemmy.ca
    link
    fedilink
    English
    arrow-up
    0
    ·
    2 days ago

    I know the pain. While there are definitely solutions that work sometimes, there’s just no “one size fits all” that I’m aware of. PDFs can represent text very differently internally.

    What I did for one project where extracting the text produced a complete mess was to convert the PDF pages to images and then OCR them…

      • observantTrapezium@lemmy.ca
        link
        fedilink
        English
        arrow-up
        0
        ·
        2 days ago

        Interesting, I’ll keep it in mind next time I have to deal with this problem (hopefully never but who knows).

        A few years ago I was in contact with researchers that were developing an AI tool to parse PDFs (I think they didn’t care about converting to editable formats, but extracting data), from their material I got the impression that it’s extremely difficult to do right using traditional algorithms.