What determines parquet file buffer sizes
I wrote a DataFrame in spark-shell into hdfs and I got the below output. What I want to understand is, what determines the size of the parquet files being written? My dfs.block.size is set to:
scala> spark.sparkContext.hadoopConfiguration.get("dfs.block.size")
res1: String = 134217728
which is 128MB, so why are my files in the 20,000,000 range of bytes?
-rw-r--r-- 1 hadoop supergroup 0 2018-11-13 11:51 /new_sample_parquet_test/_SUCCESS
-rw-r--r-- 1 hadoop supergroup 23631191 2018-11-13 11:51 /new_sample_parquet_test/part-00000-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 23435545 2018-11-13 11:51 /new_sample_parquet_test/part-00001-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22568091 2018-11-13 11:51 /new_sample_parquet_test/part-00002-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 23385544 2018-11-13 11:51 /new_sample_parquet_test/part-00003-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 23335676 2018-11-13 11:51 /new_sample_parquet_test/part-00004-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 23423372 2018-11-13 11:51 /new_sample_parquet_test/part-00005-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22182760 2018-11-13 11:51 /new_sample_parquet_test/part-00006-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20906453 2018-11-13 11:51 /new_sample_parquet_test/part-00007-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22918107 2018-11-13 11:51 /new_sample_parquet_test/part-00008-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21655224 2018-11-13 11:51 /new_sample_parquet_test/part-00009-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20366872 2018-11-13 11:51 /new_sample_parquet_test/part-00010-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22658141 2018-11-13 11:51 /new_sample_parquet_test/part-00011-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22246580 2018-11-13 11:51 /new_sample_parquet_test/part-00012-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20648612 2018-11-13 11:51 /new_sample_parquet_test/part-00013-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22369663 2018-11-13 11:51 /new_sample_parquet_test/part-00014-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 23396027 2018-11-13 11:51 /new_sample_parquet_test/part-00015-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 23382811 2018-11-13 11:51 /new_sample_parquet_test/part-00016-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 17470540 2018-11-13 11:51 /new_sample_parquet_test/part-00017-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22669018 2018-11-13 11:51 /new_sample_parquet_test/part-00018-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21899425 2018-11-13 11:51 /new_sample_parquet_test/part-00019-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21378060 2018-11-13 11:51 /new_sample_parquet_test/part-00020-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21578176 2018-11-13 11:51 /new_sample_parquet_test/part-00021-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21079291 2018-11-13 11:51 /new_sample_parquet_test/part-00022-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21526313 2018-11-13 11:51 /new_sample_parquet_test/part-00023-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22446489 2018-11-13 11:51 /new_sample_parquet_test/part-00024-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21770955 2018-11-13 11:51 /new_sample_parquet_test/part-00025-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 23199003 2018-11-13 11:51 /new_sample_parquet_test/part-00026-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21833916 2018-11-13 11:51 /new_sample_parquet_test/part-00027-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 25090443 2018-11-13 11:51 /new_sample_parquet_test/part-00028-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20725755 2018-11-13 11:51 /new_sample_parquet_test/part-00029-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20666565 2018-11-13 11:51 /new_sample_parquet_test/part-00030-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22299474 2018-11-13 11:51 /new_sample_parquet_test/part-00031-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22327133 2018-11-13 11:51 /new_sample_parquet_test/part-00032-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22207468 2018-11-13 11:51 /new_sample_parquet_test/part-00033-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22630251 2018-11-13 11:51 /new_sample_parquet_test/part-00034-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21648270 2018-11-13 11:51 /new_sample_parquet_test/part-00035-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22230127 2018-11-13 11:51 /new_sample_parquet_test/part-00036-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22439910 2018-11-13 11:51 /new_sample_parquet_test/part-00037-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22252551 2018-11-13 11:51 /new_sample_parquet_test/part-00038-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22160655 2018-11-13 11:51 /new_sample_parquet_test/part-00039-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 17637580 2018-11-13 11:51 /new_sample_parquet_test/part-00040-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21743969 2018-11-13 11:51 /new_sample_parquet_test/part-00041-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22071235 2018-11-13 11:51 /new_sample_parquet_test/part-00042-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21854771 2018-11-13 11:51 /new_sample_parquet_test/part-00043-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 25243330 2018-11-13 11:51 /new_sample_parquet_test/part-00044-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22297865 2018-11-13 11:51 /new_sample_parquet_test/part-00045-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22070057 2018-11-13 11:51 /new_sample_parquet_test/part-00046-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22018671 2018-11-13 11:51 /new_sample_parquet_test/part-00047-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21796749 2018-11-13 11:51 /new_sample_parquet_test/part-00048-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22597634 2018-11-13 11:51 /new_sample_parquet_test/part-00049-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20728588 2018-11-13 11:51 /new_sample_parquet_test/part-00050-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22137701 2018-11-13 11:51 /new_sample_parquet_test/part-00051-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22387635 2018-11-13 11:51 /new_sample_parquet_test/part-00052-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20965957 2018-11-13 11:51 /new_sample_parquet_test/part-00053-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20314451 2018-11-13 11:51 /new_sample_parquet_test/part-00054-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22538965 2018-11-13 11:51 /new_sample_parquet_test/part-00055-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20923261 2018-11-13 11:51 /new_sample_parquet_test/part-00056-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20984805 2018-11-13 11:51 /new_sample_parquet_test/part-00057-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20513317 2018-11-13 11:51 /new_sample_parquet_test/part-00058-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 25493903 2018-11-13 11:51 /new_sample_parquet_test/part-00059-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21178862 2018-11-13 11:51 /new_sample_parquet_test/part-00060-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20696540 2018-11-13 11:51 /new_sample_parquet_test/part-00061-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21011416 2018-11-13 11:51 /new_sample_parquet_test/part-00062-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 15752503 2018-11-13 11:51 /new_sample_parquet_test/part-00063-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
apache-spark hadoop hdfs parquet
add a comment |
I wrote a DataFrame in spark-shell into hdfs and I got the below output. What I want to understand is, what determines the size of the parquet files being written? My dfs.block.size is set to:
scala> spark.sparkContext.hadoopConfiguration.get("dfs.block.size")
res1: String = 134217728
which is 128MB, so why are my files in the 20,000,000 range of bytes?
-rw-r--r-- 1 hadoop supergroup 0 2018-11-13 11:51 /new_sample_parquet_test/_SUCCESS
-rw-r--r-- 1 hadoop supergroup 23631191 2018-11-13 11:51 /new_sample_parquet_test/part-00000-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 23435545 2018-11-13 11:51 /new_sample_parquet_test/part-00001-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22568091 2018-11-13 11:51 /new_sample_parquet_test/part-00002-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 23385544 2018-11-13 11:51 /new_sample_parquet_test/part-00003-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 23335676 2018-11-13 11:51 /new_sample_parquet_test/part-00004-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 23423372 2018-11-13 11:51 /new_sample_parquet_test/part-00005-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22182760 2018-11-13 11:51 /new_sample_parquet_test/part-00006-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20906453 2018-11-13 11:51 /new_sample_parquet_test/part-00007-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22918107 2018-11-13 11:51 /new_sample_parquet_test/part-00008-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21655224 2018-11-13 11:51 /new_sample_parquet_test/part-00009-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20366872 2018-11-13 11:51 /new_sample_parquet_test/part-00010-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22658141 2018-11-13 11:51 /new_sample_parquet_test/part-00011-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22246580 2018-11-13 11:51 /new_sample_parquet_test/part-00012-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20648612 2018-11-13 11:51 /new_sample_parquet_test/part-00013-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22369663 2018-11-13 11:51 /new_sample_parquet_test/part-00014-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 23396027 2018-11-13 11:51 /new_sample_parquet_test/part-00015-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 23382811 2018-11-13 11:51 /new_sample_parquet_test/part-00016-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 17470540 2018-11-13 11:51 /new_sample_parquet_test/part-00017-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22669018 2018-11-13 11:51 /new_sample_parquet_test/part-00018-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21899425 2018-11-13 11:51 /new_sample_parquet_test/part-00019-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21378060 2018-11-13 11:51 /new_sample_parquet_test/part-00020-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21578176 2018-11-13 11:51 /new_sample_parquet_test/part-00021-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21079291 2018-11-13 11:51 /new_sample_parquet_test/part-00022-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21526313 2018-11-13 11:51 /new_sample_parquet_test/part-00023-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22446489 2018-11-13 11:51 /new_sample_parquet_test/part-00024-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21770955 2018-11-13 11:51 /new_sample_parquet_test/part-00025-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 23199003 2018-11-13 11:51 /new_sample_parquet_test/part-00026-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21833916 2018-11-13 11:51 /new_sample_parquet_test/part-00027-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 25090443 2018-11-13 11:51 /new_sample_parquet_test/part-00028-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20725755 2018-11-13 11:51 /new_sample_parquet_test/part-00029-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20666565 2018-11-13 11:51 /new_sample_parquet_test/part-00030-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22299474 2018-11-13 11:51 /new_sample_parquet_test/part-00031-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22327133 2018-11-13 11:51 /new_sample_parquet_test/part-00032-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22207468 2018-11-13 11:51 /new_sample_parquet_test/part-00033-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22630251 2018-11-13 11:51 /new_sample_parquet_test/part-00034-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21648270 2018-11-13 11:51 /new_sample_parquet_test/part-00035-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22230127 2018-11-13 11:51 /new_sample_parquet_test/part-00036-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22439910 2018-11-13 11:51 /new_sample_parquet_test/part-00037-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22252551 2018-11-13 11:51 /new_sample_parquet_test/part-00038-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22160655 2018-11-13 11:51 /new_sample_parquet_test/part-00039-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 17637580 2018-11-13 11:51 /new_sample_parquet_test/part-00040-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21743969 2018-11-13 11:51 /new_sample_parquet_test/part-00041-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22071235 2018-11-13 11:51 /new_sample_parquet_test/part-00042-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21854771 2018-11-13 11:51 /new_sample_parquet_test/part-00043-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 25243330 2018-11-13 11:51 /new_sample_parquet_test/part-00044-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22297865 2018-11-13 11:51 /new_sample_parquet_test/part-00045-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22070057 2018-11-13 11:51 /new_sample_parquet_test/part-00046-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22018671 2018-11-13 11:51 /new_sample_parquet_test/part-00047-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21796749 2018-11-13 11:51 /new_sample_parquet_test/part-00048-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22597634 2018-11-13 11:51 /new_sample_parquet_test/part-00049-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20728588 2018-11-13 11:51 /new_sample_parquet_test/part-00050-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22137701 2018-11-13 11:51 /new_sample_parquet_test/part-00051-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22387635 2018-11-13 11:51 /new_sample_parquet_test/part-00052-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20965957 2018-11-13 11:51 /new_sample_parquet_test/part-00053-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20314451 2018-11-13 11:51 /new_sample_parquet_test/part-00054-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22538965 2018-11-13 11:51 /new_sample_parquet_test/part-00055-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20923261 2018-11-13 11:51 /new_sample_parquet_test/part-00056-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20984805 2018-11-13 11:51 /new_sample_parquet_test/part-00057-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20513317 2018-11-13 11:51 /new_sample_parquet_test/part-00058-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 25493903 2018-11-13 11:51 /new_sample_parquet_test/part-00059-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21178862 2018-11-13 11:51 /new_sample_parquet_test/part-00060-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20696540 2018-11-13 11:51 /new_sample_parquet_test/part-00061-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21011416 2018-11-13 11:51 /new_sample_parquet_test/part-00062-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 15752503 2018-11-13 11:51 /new_sample_parquet_test/part-00063-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
apache-spark hadoop hdfs parquet
add a comment |
I wrote a DataFrame in spark-shell into hdfs and I got the below output. What I want to understand is, what determines the size of the parquet files being written? My dfs.block.size is set to:
scala> spark.sparkContext.hadoopConfiguration.get("dfs.block.size")
res1: String = 134217728
which is 128MB, so why are my files in the 20,000,000 range of bytes?
-rw-r--r-- 1 hadoop supergroup 0 2018-11-13 11:51 /new_sample_parquet_test/_SUCCESS
-rw-r--r-- 1 hadoop supergroup 23631191 2018-11-13 11:51 /new_sample_parquet_test/part-00000-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 23435545 2018-11-13 11:51 /new_sample_parquet_test/part-00001-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22568091 2018-11-13 11:51 /new_sample_parquet_test/part-00002-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 23385544 2018-11-13 11:51 /new_sample_parquet_test/part-00003-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 23335676 2018-11-13 11:51 /new_sample_parquet_test/part-00004-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 23423372 2018-11-13 11:51 /new_sample_parquet_test/part-00005-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22182760 2018-11-13 11:51 /new_sample_parquet_test/part-00006-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20906453 2018-11-13 11:51 /new_sample_parquet_test/part-00007-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22918107 2018-11-13 11:51 /new_sample_parquet_test/part-00008-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21655224 2018-11-13 11:51 /new_sample_parquet_test/part-00009-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20366872 2018-11-13 11:51 /new_sample_parquet_test/part-00010-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22658141 2018-11-13 11:51 /new_sample_parquet_test/part-00011-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22246580 2018-11-13 11:51 /new_sample_parquet_test/part-00012-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20648612 2018-11-13 11:51 /new_sample_parquet_test/part-00013-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22369663 2018-11-13 11:51 /new_sample_parquet_test/part-00014-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 23396027 2018-11-13 11:51 /new_sample_parquet_test/part-00015-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 23382811 2018-11-13 11:51 /new_sample_parquet_test/part-00016-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 17470540 2018-11-13 11:51 /new_sample_parquet_test/part-00017-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22669018 2018-11-13 11:51 /new_sample_parquet_test/part-00018-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21899425 2018-11-13 11:51 /new_sample_parquet_test/part-00019-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21378060 2018-11-13 11:51 /new_sample_parquet_test/part-00020-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21578176 2018-11-13 11:51 /new_sample_parquet_test/part-00021-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21079291 2018-11-13 11:51 /new_sample_parquet_test/part-00022-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21526313 2018-11-13 11:51 /new_sample_parquet_test/part-00023-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22446489 2018-11-13 11:51 /new_sample_parquet_test/part-00024-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21770955 2018-11-13 11:51 /new_sample_parquet_test/part-00025-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 23199003 2018-11-13 11:51 /new_sample_parquet_test/part-00026-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21833916 2018-11-13 11:51 /new_sample_parquet_test/part-00027-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 25090443 2018-11-13 11:51 /new_sample_parquet_test/part-00028-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20725755 2018-11-13 11:51 /new_sample_parquet_test/part-00029-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20666565 2018-11-13 11:51 /new_sample_parquet_test/part-00030-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22299474 2018-11-13 11:51 /new_sample_parquet_test/part-00031-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22327133 2018-11-13 11:51 /new_sample_parquet_test/part-00032-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22207468 2018-11-13 11:51 /new_sample_parquet_test/part-00033-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22630251 2018-11-13 11:51 /new_sample_parquet_test/part-00034-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21648270 2018-11-13 11:51 /new_sample_parquet_test/part-00035-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22230127 2018-11-13 11:51 /new_sample_parquet_test/part-00036-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22439910 2018-11-13 11:51 /new_sample_parquet_test/part-00037-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22252551 2018-11-13 11:51 /new_sample_parquet_test/part-00038-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22160655 2018-11-13 11:51 /new_sample_parquet_test/part-00039-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 17637580 2018-11-13 11:51 /new_sample_parquet_test/part-00040-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21743969 2018-11-13 11:51 /new_sample_parquet_test/part-00041-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22071235 2018-11-13 11:51 /new_sample_parquet_test/part-00042-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21854771 2018-11-13 11:51 /new_sample_parquet_test/part-00043-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 25243330 2018-11-13 11:51 /new_sample_parquet_test/part-00044-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22297865 2018-11-13 11:51 /new_sample_parquet_test/part-00045-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22070057 2018-11-13 11:51 /new_sample_parquet_test/part-00046-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22018671 2018-11-13 11:51 /new_sample_parquet_test/part-00047-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21796749 2018-11-13 11:51 /new_sample_parquet_test/part-00048-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22597634 2018-11-13 11:51 /new_sample_parquet_test/part-00049-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20728588 2018-11-13 11:51 /new_sample_parquet_test/part-00050-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22137701 2018-11-13 11:51 /new_sample_parquet_test/part-00051-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22387635 2018-11-13 11:51 /new_sample_parquet_test/part-00052-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20965957 2018-11-13 11:51 /new_sample_parquet_test/part-00053-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20314451 2018-11-13 11:51 /new_sample_parquet_test/part-00054-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22538965 2018-11-13 11:51 /new_sample_parquet_test/part-00055-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20923261 2018-11-13 11:51 /new_sample_parquet_test/part-00056-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20984805 2018-11-13 11:51 /new_sample_parquet_test/part-00057-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20513317 2018-11-13 11:51 /new_sample_parquet_test/part-00058-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 25493903 2018-11-13 11:51 /new_sample_parquet_test/part-00059-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21178862 2018-11-13 11:51 /new_sample_parquet_test/part-00060-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20696540 2018-11-13 11:51 /new_sample_parquet_test/part-00061-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21011416 2018-11-13 11:51 /new_sample_parquet_test/part-00062-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 15752503 2018-11-13 11:51 /new_sample_parquet_test/part-00063-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
apache-spark hadoop hdfs parquet
I wrote a DataFrame in spark-shell into hdfs and I got the below output. What I want to understand is, what determines the size of the parquet files being written? My dfs.block.size is set to:
scala> spark.sparkContext.hadoopConfiguration.get("dfs.block.size")
res1: String = 134217728
which is 128MB, so why are my files in the 20,000,000 range of bytes?
-rw-r--r-- 1 hadoop supergroup 0 2018-11-13 11:51 /new_sample_parquet_test/_SUCCESS
-rw-r--r-- 1 hadoop supergroup 23631191 2018-11-13 11:51 /new_sample_parquet_test/part-00000-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 23435545 2018-11-13 11:51 /new_sample_parquet_test/part-00001-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22568091 2018-11-13 11:51 /new_sample_parquet_test/part-00002-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 23385544 2018-11-13 11:51 /new_sample_parquet_test/part-00003-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 23335676 2018-11-13 11:51 /new_sample_parquet_test/part-00004-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 23423372 2018-11-13 11:51 /new_sample_parquet_test/part-00005-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22182760 2018-11-13 11:51 /new_sample_parquet_test/part-00006-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20906453 2018-11-13 11:51 /new_sample_parquet_test/part-00007-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22918107 2018-11-13 11:51 /new_sample_parquet_test/part-00008-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21655224 2018-11-13 11:51 /new_sample_parquet_test/part-00009-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20366872 2018-11-13 11:51 /new_sample_parquet_test/part-00010-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22658141 2018-11-13 11:51 /new_sample_parquet_test/part-00011-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22246580 2018-11-13 11:51 /new_sample_parquet_test/part-00012-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20648612 2018-11-13 11:51 /new_sample_parquet_test/part-00013-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22369663 2018-11-13 11:51 /new_sample_parquet_test/part-00014-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 23396027 2018-11-13 11:51 /new_sample_parquet_test/part-00015-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 23382811 2018-11-13 11:51 /new_sample_parquet_test/part-00016-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 17470540 2018-11-13 11:51 /new_sample_parquet_test/part-00017-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22669018 2018-11-13 11:51 /new_sample_parquet_test/part-00018-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21899425 2018-11-13 11:51 /new_sample_parquet_test/part-00019-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21378060 2018-11-13 11:51 /new_sample_parquet_test/part-00020-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21578176 2018-11-13 11:51 /new_sample_parquet_test/part-00021-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21079291 2018-11-13 11:51 /new_sample_parquet_test/part-00022-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21526313 2018-11-13 11:51 /new_sample_parquet_test/part-00023-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22446489 2018-11-13 11:51 /new_sample_parquet_test/part-00024-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21770955 2018-11-13 11:51 /new_sample_parquet_test/part-00025-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 23199003 2018-11-13 11:51 /new_sample_parquet_test/part-00026-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21833916 2018-11-13 11:51 /new_sample_parquet_test/part-00027-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 25090443 2018-11-13 11:51 /new_sample_parquet_test/part-00028-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20725755 2018-11-13 11:51 /new_sample_parquet_test/part-00029-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20666565 2018-11-13 11:51 /new_sample_parquet_test/part-00030-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22299474 2018-11-13 11:51 /new_sample_parquet_test/part-00031-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22327133 2018-11-13 11:51 /new_sample_parquet_test/part-00032-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22207468 2018-11-13 11:51 /new_sample_parquet_test/part-00033-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22630251 2018-11-13 11:51 /new_sample_parquet_test/part-00034-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21648270 2018-11-13 11:51 /new_sample_parquet_test/part-00035-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22230127 2018-11-13 11:51 /new_sample_parquet_test/part-00036-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22439910 2018-11-13 11:51 /new_sample_parquet_test/part-00037-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22252551 2018-11-13 11:51 /new_sample_parquet_test/part-00038-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22160655 2018-11-13 11:51 /new_sample_parquet_test/part-00039-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 17637580 2018-11-13 11:51 /new_sample_parquet_test/part-00040-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21743969 2018-11-13 11:51 /new_sample_parquet_test/part-00041-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22071235 2018-11-13 11:51 /new_sample_parquet_test/part-00042-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21854771 2018-11-13 11:51 /new_sample_parquet_test/part-00043-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 25243330 2018-11-13 11:51 /new_sample_parquet_test/part-00044-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22297865 2018-11-13 11:51 /new_sample_parquet_test/part-00045-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22070057 2018-11-13 11:51 /new_sample_parquet_test/part-00046-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22018671 2018-11-13 11:51 /new_sample_parquet_test/part-00047-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21796749 2018-11-13 11:51 /new_sample_parquet_test/part-00048-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22597634 2018-11-13 11:51 /new_sample_parquet_test/part-00049-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20728588 2018-11-13 11:51 /new_sample_parquet_test/part-00050-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22137701 2018-11-13 11:51 /new_sample_parquet_test/part-00051-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22387635 2018-11-13 11:51 /new_sample_parquet_test/part-00052-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20965957 2018-11-13 11:51 /new_sample_parquet_test/part-00053-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20314451 2018-11-13 11:51 /new_sample_parquet_test/part-00054-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 22538965 2018-11-13 11:51 /new_sample_parquet_test/part-00055-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20923261 2018-11-13 11:51 /new_sample_parquet_test/part-00056-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20984805 2018-11-13 11:51 /new_sample_parquet_test/part-00057-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20513317 2018-11-13 11:51 /new_sample_parquet_test/part-00058-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 25493903 2018-11-13 11:51 /new_sample_parquet_test/part-00059-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21178862 2018-11-13 11:51 /new_sample_parquet_test/part-00060-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 20696540 2018-11-13 11:51 /new_sample_parquet_test/part-00061-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 21011416 2018-11-13 11:51 /new_sample_parquet_test/part-00062-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
-rw-r--r-- 1 hadoop supergroup 15752503 2018-11-13 11:51 /new_sample_parquet_test/part-00063-18b6439e-ce51-49e3-afac-e93d5cf6de44-c000.snappy.parquet
apache-spark hadoop hdfs parquet
apache-spark hadoop hdfs parquet
edited Nov 13 '18 at 17:35
user3685285
asked Nov 13 '18 at 17:27
user3685285user3685285
1,56431737
1,56431737
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
Parquet writer is not concerned with HDFS block size as you can save parquet e.g. on a local hard drive. What determines the number and sizes of individual part-*.parquet files is the number of partitions in your data frame (64 in your case). If you would do df.coalesce(1).write.parquet(...), you'll have just one large part file.
If you want the part files to be around 128 Mb each, coalesce parameter should be around 20 * 64 / 128 = 10. The part file size for a given number of coalesced partitions dependency is not strictly linear though. The smaller the number of part files, the more efficient encoding/compression is.
See coalesce method description for details
I'm a little confused. Is it not true that there is one hdfs block per file? Also, there is only one partition in the data, so I still don't get why it became so many small files. This isn't implying that there is one file per partition is it?
– user3685285
Nov 13 '18 at 19:04
> Is it not true that there is one hdfs block per file? Of course there may be multiple blocks per file, otherwise how would you store, say 1 Gb file? > Also, there is only one partition in the data Are you sure you only have a single partition? What df.rdd.getNumPartitions shows for your DF?
– Denis Makarenko
Nov 13 '18 at 21:30
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53286512%2fwhat-determines-parquet-file-buffer-sizes%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Parquet writer is not concerned with HDFS block size as you can save parquet e.g. on a local hard drive. What determines the number and sizes of individual part-*.parquet files is the number of partitions in your data frame (64 in your case). If you would do df.coalesce(1).write.parquet(...), you'll have just one large part file.
If you want the part files to be around 128 Mb each, coalesce parameter should be around 20 * 64 / 128 = 10. The part file size for a given number of coalesced partitions dependency is not strictly linear though. The smaller the number of part files, the more efficient encoding/compression is.
See coalesce method description for details
I'm a little confused. Is it not true that there is one hdfs block per file? Also, there is only one partition in the data, so I still don't get why it became so many small files. This isn't implying that there is one file per partition is it?
– user3685285
Nov 13 '18 at 19:04
> Is it not true that there is one hdfs block per file? Of course there may be multiple blocks per file, otherwise how would you store, say 1 Gb file? > Also, there is only one partition in the data Are you sure you only have a single partition? What df.rdd.getNumPartitions shows for your DF?
– Denis Makarenko
Nov 13 '18 at 21:30
add a comment |
Parquet writer is not concerned with HDFS block size as you can save parquet e.g. on a local hard drive. What determines the number and sizes of individual part-*.parquet files is the number of partitions in your data frame (64 in your case). If you would do df.coalesce(1).write.parquet(...), you'll have just one large part file.
If you want the part files to be around 128 Mb each, coalesce parameter should be around 20 * 64 / 128 = 10. The part file size for a given number of coalesced partitions dependency is not strictly linear though. The smaller the number of part files, the more efficient encoding/compression is.
See coalesce method description for details
I'm a little confused. Is it not true that there is one hdfs block per file? Also, there is only one partition in the data, so I still don't get why it became so many small files. This isn't implying that there is one file per partition is it?
– user3685285
Nov 13 '18 at 19:04
> Is it not true that there is one hdfs block per file? Of course there may be multiple blocks per file, otherwise how would you store, say 1 Gb file? > Also, there is only one partition in the data Are you sure you only have a single partition? What df.rdd.getNumPartitions shows for your DF?
– Denis Makarenko
Nov 13 '18 at 21:30
add a comment |
Parquet writer is not concerned with HDFS block size as you can save parquet e.g. on a local hard drive. What determines the number and sizes of individual part-*.parquet files is the number of partitions in your data frame (64 in your case). If you would do df.coalesce(1).write.parquet(...), you'll have just one large part file.
If you want the part files to be around 128 Mb each, coalesce parameter should be around 20 * 64 / 128 = 10. The part file size for a given number of coalesced partitions dependency is not strictly linear though. The smaller the number of part files, the more efficient encoding/compression is.
See coalesce method description for details
Parquet writer is not concerned with HDFS block size as you can save parquet e.g. on a local hard drive. What determines the number and sizes of individual part-*.parquet files is the number of partitions in your data frame (64 in your case). If you would do df.coalesce(1).write.parquet(...), you'll have just one large part file.
If you want the part files to be around 128 Mb each, coalesce parameter should be around 20 * 64 / 128 = 10. The part file size for a given number of coalesced partitions dependency is not strictly linear though. The smaller the number of part files, the more efficient encoding/compression is.
See coalesce method description for details
answered Nov 13 '18 at 19:01
Denis MakarenkoDenis Makarenko
1,971821
1,971821
I'm a little confused. Is it not true that there is one hdfs block per file? Also, there is only one partition in the data, so I still don't get why it became so many small files. This isn't implying that there is one file per partition is it?
– user3685285
Nov 13 '18 at 19:04
> Is it not true that there is one hdfs block per file? Of course there may be multiple blocks per file, otherwise how would you store, say 1 Gb file? > Also, there is only one partition in the data Are you sure you only have a single partition? What df.rdd.getNumPartitions shows for your DF?
– Denis Makarenko
Nov 13 '18 at 21:30
add a comment |
I'm a little confused. Is it not true that there is one hdfs block per file? Also, there is only one partition in the data, so I still don't get why it became so many small files. This isn't implying that there is one file per partition is it?
– user3685285
Nov 13 '18 at 19:04
> Is it not true that there is one hdfs block per file? Of course there may be multiple blocks per file, otherwise how would you store, say 1 Gb file? > Also, there is only one partition in the data Are you sure you only have a single partition? What df.rdd.getNumPartitions shows for your DF?
– Denis Makarenko
Nov 13 '18 at 21:30
I'm a little confused. Is it not true that there is one hdfs block per file? Also, there is only one partition in the data, so I still don't get why it became so many small files. This isn't implying that there is one file per partition is it?
– user3685285
Nov 13 '18 at 19:04
I'm a little confused. Is it not true that there is one hdfs block per file? Also, there is only one partition in the data, so I still don't get why it became so many small files. This isn't implying that there is one file per partition is it?
– user3685285
Nov 13 '18 at 19:04
> Is it not true that there is one hdfs block per file? Of course there may be multiple blocks per file, otherwise how would you store, say 1 Gb file? > Also, there is only one partition in the data Are you sure you only have a single partition? What df.rdd.getNumPartitions shows for your DF?
– Denis Makarenko
Nov 13 '18 at 21:30
> Is it not true that there is one hdfs block per file? Of course there may be multiple blocks per file, otherwise how would you store, say 1 Gb file? > Also, there is only one partition in the data Are you sure you only have a single partition? What df.rdd.getNumPartitions shows for your DF?
– Denis Makarenko
Nov 13 '18 at 21:30
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53286512%2fwhat-determines-parquet-file-buffer-sizes%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown