DuckDB
DuckDB is a database that supports reading and querying Parquet files really fast. Begin by creating a connection to DuckDB, and then install and load the httpfs extension to read and write remote files:
			Python
		
			JavaScript
		
import duckdb
url = "https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet"
con = duckdb.connect()
con.execute("INSTALL httpfs;")
con.execute("LOAD httpfs;")Now you can write and execute your SQL query on the Parquet file:
			Python
		
			JavaScript
		
con.sql(f"SELECT horoscope, count(*), AVG(LENGTH(text)) AS avg_blog_length FROM '{url}' GROUP BY horoscope ORDER BY avg_blog_length DESC LIMIT(5)")
βββββββββββββ¬βββββββββββββββ¬βββββββββββββββββββββ
β horoscope β count_star() β  avg_blog_length   β
β  varchar  β    int64     β       double       β
βββββββββββββΌβββββββββββββββΌβββββββββββββββββββββ€
β Aquarius  β        34062 β  1129.218836239798 β
β Cancer    β        41509 β  1098.366812016671 β
β Capricorn β        33961 β 1073.2002002296751 β
β Libra     β        40302 β 1072.0718326633914 β
β Leo       β        40587 β 1064.0536871412028 β
βββββββββββββ΄βββββββββββββββ΄βββββββββββββββββββββTo query multiple files - for example, if the dataset is sharded:
			Python
		
			JavaScript
		
con.sql(f"SELECT horoscope, count(*), AVG(LENGTH(text)) AS avg_blog_length FROM read_parquet({urls[:2]}) GROUP BY horoscope ORDER BY avg_blog_length DESC LIMIT(5)")
βββββββββββββββ¬βββββββββββββββ¬βββββββββββββββββββββ
β  horoscope  β count_star() β  avg_blog_length   β
β   varchar   β    int64     β       double       β
βββββββββββββββΌβββββββββββββββΌβββββββββββββββββββββ€
β Aquarius    β        49568 β 1125.8306770497095 β
β Cancer      β        63512 β   1097.95608703867 β
β Libra       β        60304 β 1060.6110539931017 β
β Capricorn   β        49402 β 1059.5552609206104 β
β Sagittarius β        50431 β 1057.4589835616982 β
βββββββββββββββ΄βββββββββββββββ΄βββββββββββββββββββββDuckDB-Wasm, a package powered by WebAssembly, is also available for running DuckDB in any browser. This could be useful, for instance, if you want to create a web app to query Parquet files from the browser!