I have aquired several very large files. Specifically, CVSs of 100+ GB.

I want to search for text in these files faster than manually running grep.

To do this, I need to index the files right? Would something like Aleph be good for this? It seems like the right tool…

https://github.com/alephdata/aleph

Any other tools for doing this?

  • Jason2357@lemmy.ca
    link
    fedilink
    English
    arrow-up
    0
    ·
    7 days ago

    Parquet is great, especially if there is some reasonable way of partitioning records - for example, by month or year - if you might need to only search 2024 or something like that. Parquet is great for only needing to I/O the specific variables you are concerned with, and if you can partition the records and only subset a fraction of them, operations can be extremely efficient.