I have aquired several very large files. Specifically, CVSs of 100+ GB.

I want to search for text in these files faster than manually running grep.

To do this, I need to index the files right? Would something like Aleph be good for this? It seems like the right tool…

https://github.com/alephdata/aleph

Any other tools for doing this?

  • wise_pancake@lemmy.ca
    link
    fedilink
    English
    arrow-up
    0
    ·
    edit-2
    6 days ago

    If possible convert those files to compressed parquet, and apply sorting and partitioning to them.

    I’ve gotten 10-100gb csv files down to 300mb-5gb sizes just by doing that

    That makes searching and scanning so much faster, and you can do this all with open source free software like polars and ibis.

    • Jason2357@lemmy.ca
      link
      fedilink
      English
      arrow-up
      0
      ·
      6 days ago

      Parquet is great, especially if there is some reasonable way of partitioning records - for example, by month or year - if you might need to only search 2024 or something like that. Parquet is great for only needing to I/O the specific variables you are concerned with, and if you can partition the records and only subset a fraction of them, operations can be extremely efficient.

  • TiTeY`@jlai.lu
    link
    fedilink
    English
    arrow-up
    0
    ·
    edit-2
    7 days ago

    If CSV entries are similar, you can try Opensearch or Elasticsearch. It’s great for plain text search (with Lucene)

  • irotsoma@lemmy.blahaj.zone
    link
    fedilink
    English
    arrow-up
    0
    ·
    7 days ago

    I’ve used java Scanner objects to do this extremely efficiently with minimal memory required even with multiple parallel searches. Indexing is only necessary if you want to search for information many times and don’t know what exactly the search will be. For one time searches, it’s not going to be useful. Grep honestly is going to be faster and more efficient for most one time searches.

    The initial indexing or searching of the files will be bottlenecked by the speed of the disk the files are on, no matter what you do. It only helps to index because you can move future searches to faster memory.

    So it greatly depends on what and how often you need to search and the tradeoff is memory usage, but only for multiple searches of data you choose to index from the files in the first pass.

  • yaroto98@lemmy.org
    link
    fedilink
    English
    arrow-up
    0
    ·
    7 days ago

    Done this with massive log files. Used perl and regex. That’s basically what the language was built for.

    But with CSVs? I’d throw them in a db with an index.

    • SheeEttin@lemmy.zip
      link
      fedilink
      English
      arrow-up
      0
      ·
      6 days ago

      Agreed. If the data is suitable enough, there are plenty of tools to slurp a CSV into mariadb or whatever.

  • jonne@infosec.pub
    link
    fedilink
    English
    arrow-up
    0
    ·
    7 days ago

    Really depends on what data it is and whether you want to search it regularly or just as a one time thing.

    You could load them into an rdbms (MySQL/Postgres) and have it handle the indexing, or use python tools to process the files.

    If it’s just a one time thing grep is probably fine tho.

  • tal@lemmy.today
    link
    fedilink
    English
    arrow-up
    0
    ·
    edit-2
    7 days ago

    Are you looking for specific values in some field in this table, or substrings in that field?

    If specific values, I’d probably import the CSV file into a database with an column indexed on the value you care about.