I am trying to analyze monthly wikimedia pageview statistics. Their daily dumps are OK but monthly reports like the one from June 2021 (https://dumps.wikimedia.org/other/pageview_complete/monthly/2021/2021-06/pageviews-202106-user.bz2) seem broken:
[radim@sandbox2 pageviews]$ bzip2 -t pageviews-202106-user.bz2
bzip2: pageviews-202106-user.bz2: bad magic number (file not created by bzip2)
You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.
[radim@sandbox2 pageviews]$ file pageviews-202106-user.bz2
pageviews-202106-user.bz2: Par archive data
Any idea how to extract the data? What encoding is used here? Can it be Parquet file from their Hive analytics cluster?
These files are not bzip2 archives. They are Parquet files. Parquet-tools can be used to inspect them.