Thread: Fast subsets of large datasets with Pandas and SQLite

Fast subsets of large datasets with Pandas and SQLite (pythonspeed.com)

L 5 pts 2 comments by itamarst Mar 23, 2020 databasespython Programming (General)Data / Databases / Infrastructure thread

Fast Python: High performance techniques for large datasets (manning.com)

HN 2 pts 0 comments by teleforce Jul 24, 2024 thread

Rowboat – A fast tool for understanding large datasets (rowboat.xyz)

HN 2 pts 0 comments by haraball Sep 18, 2025 thread

Fasttfidf: High-performance TF-IDF vectorization for large-scale text datasets (github.com)

HN 1 pts 0 comments by jspuri Dec 23, 2025 thread

Druid | Open-source infrastructure for Real-time Exploratory Analytics on Large datasets (druid.io)

L 3 pts 1 comments by SeanTAllen Jan 12, 2014 databasesrelease Programming (General)Data / Databases / Infrastructure thread

Dat – A git-like tool for large datasets (dat-data.com)

L 10 pts 0 comments by jonschoning Apr 26, 2014 release Programming (General) thread

Additional information: https://github.com/maxogden/dat/blob/master/what-is-dat.md Also: git repo: https://github.com/maxogden/dat example usage commands: https://github.com/maxogden/dat/blob/master/usage.md technical notes / supported formats: https://github.com/maxogden/dat/blob/maste...

(In)Security of Embedded Devices' Firmware - Fast and Furious at Large Scale (media.ccc.de)

L 1 pts 0 comments by r31r06 Jan 4, 2016 hardwarereversingsecuritytestingvideo Systems / Low-Level / OS Programming (General)Security / Privacy Maker / DIY / Hardware thread

> [..] In this talk, we present several methods that make *the large scale security analyses of embedded devices* a feasible task. We implemented those techniques in a scalable framework that we tested on real world data. First, we collected a large number of firmware images from Internet reposi...

The problem of parsing large datasets (haskell-works.github.io)

L 14 pts 0 comments by puffnfresh Jul 26, 2018 haskell Programming Languages / CS Theory thread

50 times faster data loading for Pandas: no problem, using C++ (blog.esciencecenter.nl)

L 13 pts 8 comments by egpbos Sep 3, 2018 c++compsciperformancepython Systems / Low-Level / OS Programming (General)Programming Languages / CS Theory Data / Databases / Infrastructure thread

Make Python Pandas go fast (blog.wallaroolabs.com)

L 7 pts 1 comments by pzel Sep 20, 2018 pythonshow Programming (General)Programming Languages / CS Theory thread

ELSA: Efficient Long-Term Secure Storage of Large Datasets (arxiv.org)

L 2 pts 0 comments by nickpsecurity Jan 9, 2019 cryptographydatabasespdf Data / Databases / Infrastructure thread

Abstract: "An increasing amount of information today is generated, exchanged, and stored digitally. This also includes long-lived and highly sensitive information (e.g., electronic health records, governmental documents) whose integrity and confidentiality must be protected over decades or even cent...

How fast can you allocate a large block of memory in C++? (lemire.me)

L 2 pts 0 comments by raymii Jan 14, 2020 c++ Systems / Low-Level / OS thread

From chunking to parallelism: faster Pandas with Dask (pythonspeed.com)

L 4 pts 0 comments by itamarst Apr 13, 2020 performancepython Programming (General)Data / Databases / Infrastructure thread

Codesearch: fast, indexed regexp search over large file trees (github.com)

L 9 pts 2 comments by tao_oat Jun 4, 2020 go Programming (General) thread

The fastest way to read a CSV in Pandas (pythonspeed.com)

L 18 pts 8 comments by itamarst Jan 25, 2022 performancepython Programming (General)Data / Databases / Infrastructure thread

Pandas vectorization: faster code, slower code, bloated memory (pythonspeed.com)

L 3 pts 0 comments by itamarst Jun 9, 2022 performancepython Programming (General)Data / Databases / Infrastructure thread

Fast Collisions for Large Editable Vehicles (brickadia.com)

L 4 pts 0 comments by mikejsavage May 11, 2023 games Gaming / Retro Computing thread

What's up Python? New args syntax, subinterpreters FastAPI and cuda pandas… (bitecode.dev)

L 15 pts 23 comments by BiteCode Nov 28, 2023 python Programming (General)Programming Languages / CS Theory thread

reladiff: High-performance diffing of large datasets across databases (github.com)

L 17 pts 3 comments by erezsh Jun 22, 2024 databasesperformancepythonshow Programming (General)Data / Databases / Infrastructure thread

Show HN: FP32 matmul of large matrices up to 24% faster than cuBLAS on a 4090 (github.com)

HN 4 pts 4 comments by ap4 Jul 31, 2024 thread

I decided to share a CUDA kernel I wrote over 5 months ago. Nvidia's hardware and software may surprise you.

Streaming Large Datasets in Elixir (jackmarchant.com)

HN 4 pts 0 comments by auraham Aug 10, 2024 Programming Languages / CS Theory thread

How we made querying Pandas DataFrames with chDB 87x faster (clickhouse.com)

HN 1 pts 0 comments by auxten Aug 30, 2024 thread

How we made querying Pandas DataFrames with chDB 87x faster (clickhouse.com)

L 2 pts 0 comments by knl Aug 30, 2024 databasesperformance Data / Databases / Infrastructure thread

Transparency is often lacking in datasets used to train large language models (news.mit.edu)

HN 20 pts 2 comments by meysamazad Sep 3, 2024 AI / Machine Learning thread

How we made querying Pandas DataFrames with chDB 87x faster (clickhouse.com)

HN 3 pts 0 comments by lukastyrychtr Sep 5, 2024 thread

A fast and space-efficient Base36 encoding for large data (github.com)

HN 1 pts 0 comments by iqoo Sep 24, 2024 thread

Nanocube: Lightning Fast OLAP-style point queries on Pandas DataFrames (github.com)

HN 1 pts 0 comments by febed Oct 6, 2024 thread

Show HN: Byte-Pair Encoding tokenizer for training LLMs on large datasets (github.com)

HN 5 pts 0 comments by yu3zhou4 Oct 11, 2024 thread

Show HN: How we made querying Pandas DataFrames 87x faster (clickhouse.com)

HN 2 pts 0 comments by auxten Oct 24, 2024 thread

Ask HN: Are embeddings too expensive for large datasets?

HN 1 pts 1 comments by Blue_Cosma Oct 25, 2024 AI / Machine Learning thread

Hi HN,<p>I've recently spoken with two companies that mentioned the high costs of creating embeddings on their datasets for RAG applications. A PE firm shared that generating embeddings for new data rooms could cost up to $5K, limiting how often they do it.<p>I’m having trouble understanding wh...