Hi,
I was working as a volunteer with the Metrics team under the auspices of Karsten Loesing. He ended that collaboration in 2016 in a rather unfortunate way and I have no idea if he ever looked at my code again. I do still have the code that I worked on and I still think it has some value: the main contribution is a converter from Metrics’ own data format to Parquet. Parquet is a data format optimized for Big Data analytics in tools like Spark and related technologies e.g. Jupyter notebooks. I also did spend a lot of time in understanding, organizing and writing down what all the data points that Metrics collects mean, how they correlate etc. In the end all this was meant to drive a fantastic visualization of Tor data, but that part did never materialize. However, the decision for Parquet and Spark was not trivial back then, but seems to have proven to be a good one since both still seem to be strong offerings in the Big Data space.
I haven’t since worked on the code, it never had a proper test suite and Metrics’ data formats have probably evolved from then. Also, I haven’t followed the work of the Metrics team and maybe you have developed something much better in the meantime. Back then, six years ago, most aggregation was done in Postgres, and Spark would most probably have outperformed the Postgres RDBMS easily. By the way, I just remember I also developed a giant relational schema for all of Metrics’ data...
Anyway, if any of this sounds interesting to you please contact me (off-list, as I don’t follow the lists anymore). I will be happy to provide you access to the GitLab repo, spend a few hours to get myself up to speed again on what I was doing back then, and help you evaluate if it may be of any value to you. I will however not work on the code again in earnest, as after Karsten kicked me out I completely switched back to what I was doing before I joined Tor.
Best,
Thomas