-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Hi everyone,
I posted some thoughts on Scaling Tor Metrics [0] almost two weeks ago and received very useful feedback from George, David, Thomas, and Letty. Thanks for that! In fact, this is such a big topic and this was so much feedback that I decided not to respond to every idea individually but instead start over and suggest a new plan that incorporates all feedback I saw. Otherwise, I'd be worried that we'd lose ourselves in the details and miss the big picture. Maybe this also makes this topic somewhat easier to follow and respond to, which I hope many people will do.
The problem to solve is still that the Tor Metrics website has a huge product backlog and that we want it to remain "the primary place to learn interesting facts about the Tor network", either by making it better and bigger internally, or by adding more ways to let others contribute to it externally.
- From the feedback in round 1, I observe three major areas where we need to improve Tor Metrics:
1 Metrics frontend 2 Metrics backend 3 External contributions
There are low-hanging fruit in each area, but there are many fruit overall and some are hanging higher than we might think. I'll go through them area by area and assign numbers to the tasks.
1 Metrics frontend
The frontend is the part of the Tor Metrics website that takes pre-aggregated data in the .csv format as input and produces somewhat interactive visualizations. The current frontend uses Java servlets/JSP as web framework and R/ggplot2 for graphs. This code is hard to extend, even for me, and the result isn't pretty, but therefore the website doesn't require any JavaScript.
So, one task would be: #1 decide whether we can still ignore JavaScript and what it has to offer. I agree that D3.js is cool, I even used it myself in the past, though I know very little about it. This decision would mean that we develop new visualizations in D3.js and phase out the existing R/ggplot2 visualizations one by one. This is a tough decision, but one with a lot of potential. I understand how we're excited about this as developers, but I'd want to ask Metrics users about this first.
Another task would be: #2 website redesign. In fact, what you see on the website right now is a redesign half-way through. Believe me, the website was even less readable a year or two ago, and this statement alone tells you how slow things are moving. But the remaining step is just to replace the start page with a gallery, well, and to apply a Bootstrap design to everything, because why not. One challenge here is that the current graphs all look quite similar making them hard to distinguish in a gallery, but that's probably still more useful than putting text there. It sounds like Letty and Thomas might be willing to help out with this, which would be great.
Yet another task would be: #3 replace the website framework with something more recent. This can be something simple, as long as it supports some basic filtering and maybe searching on the start page. I'd say let's pick something Python-based here. However, maybe we should first replace the existing graphs that are deeply tied into the current website framework. If we switch to D3.js and have replaced all existing graphs, this switch to a new website framework will hurt a lot less.
Another high-hanging fruit would be: #4 build something like Thomas' Visionion, where users can create generic visualizations on the fly. This is an exciting idea, really, but I think we have to accept that it's out of reach for now.
2 Metrics backend
The backend of Tor Metrics consists of a bunch of programs that run once per day to fetch new data from CollecTor and produce a set of .csv files for the frontend. There are no strict requirements to languages and databases, as long as tools are available in Debian stable. Some programs use PostgreSQL, but most of them just use files. Ironically, it's the database-based tools that have major performance problems, whereas the file-based ones work just fine. Most programs are written in Java, very few in Python.
One rather low-hanging fruit would be: #5 document backend programs and say what's required to add one more to the bunch. The biggest challenge in writing such a program is that it needs to stay reasonably fast even over the next couple of years and even if the network doubles or triples in size. I started documenting things a while ago, but got distracted by other things. I wouldn't mind help.
Another rather low-hanging fruit would be: #6 use Big Data to produce pre-aggregated data for the frontend. As said before, it doesn't matter whether a backend program uses files or PostgreSQL or another database engine. What matters is that it reads data from CollecTor and produces a data that the frontend can use. This could be a CSV or JSON file. We should probably pick the next visualization project as test case for applying Big Data tools, or we could rewrite one that needs to be rewritten for performance reasons anyway.
Here's a not-so-low-hanging fruit for the backend: #7 have the backend provide an API to the frontend. This is potentially more difficult. The part that I'm very much afraid of is performance. It's just too easy to build a tool that performs reasonable during testing but that can't handle the load of 10 or 100 people looking at a frontend visualization at once. In particular, it's easy to build an API that works just fine and then add another feature that looks harmless, which later turns out to hurt performance a lot. I'd say postpone.
3 External contributions
Most of the discussion in round 1 circled around how external contributions are great, but that they need to be handled with care. I understand the risks here, and I think I'll postpone the part where we're including externally developed websites after "redressing" them. Let's instead try to either keep those contributions as external links or properly integrate them into Tor Metrics.
The lowest-hanging fruit here is: #8 keep adding links to external websites as we already do, assuming that we clearly mark these contributions as external.
Another low-hanging fruit is: #9 add static data or static graphs. I heard a single +1, but I guess once we have concrete suggestions what data or graphs to add, we'll have more discussion. That's fine.
One important and still somewhat low-hanging fruit is: #10 give external developers more support when developing visualizations that could later be added to Metrics. This requires better documentation, but it also requires making it easier to install Tor Metrics locally and test new additions before submitting them. The latter is a good goal, but we're not there yet. The documentation part doesn't seem crazy though. David, if you don't mind being the guinea pig yet once more, I'd want to try this out with your latest visualizations. This is pending on the JavaScript decision though.
Now to the higher-hanging fruit: #11 Build or adapt a tool like Munin to prototype new visualizations quickly. I didn't fully understand how Munin makes it easier to prototype a visualization than just writing a Python/Stem script for the data pre-processing and using any graphing engine for the visualization. But maybe there's something to learn here. Still, this seems like a quite high goal at the moment.
Another one: #12 provide a public API like Onionoo but for Metrics data. This seems somewhat out of reach for the moment. It's a cool idea, but doing it right is not at all trivial. I'd want to put this to the lower end of the list.
Another high-hanging fruit: #13 adapt the Gist idea where external contributors write some code that magically turn into new visualizations. I think that most new visualizations require writing backend code to aggregate different parts of the available data or to aggregate the same data differently. It's a neat idea that external contributors could simply write some code that then magically turns into new visualizations. But I think it's a long way until we're there.
4 Summary
Here's the list of low-hanging fruit:
#1 decide whether we can still ignore JavaScript #2 website redesign #5 document backend programs #6 use Big Data to produce pre-aggregated data #8 keep adding links to external websites #9 add static data or static graphs #10 give external developers more support
And here are the tasks that I think we should postpone a bit longer:
#3 replace the website framework #4 build something like Thomas' Visionion #7 have the backend provide an API to the frontend #11 Build or adapt a tool like Munin #12 provide a public API like Onionoo but for Metrics data #13 adapt the Gist idea
How does this sound? What did I miss (sorry!), on a high level without going into all the details just yet? Who wants to help?
All the best, Karsten
[0] https://lists.torproject.org/pipermail/tor-dev/2015-November/009983.html