I have aquired several very large files. Specifically, CVSs of 100+ GB.
I want to search for text in these files faster than manually running grep.
To do this, I need to index the files right? Would something like Aleph be good for this? It seems like the right tool…
https://github.com/alephdata/aleph
Any other tools for doing this?
Parquet is great, especially if there is some reasonable way of partitioning records - for example, by month or year - if you might need to only search 2024 or something like that. Parquet is great for only needing to I/O the specific variables you are concerned with, and if you can partition the records and only subset a fraction of them, operations can be extremely efficient.