-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Hi everyone,
I posted some thoughts on Scaling Tor Metrics [0] almost two weeks ago and received very useful feedback from George, David, Thomas, and Letty. Thanks for that! In fact, this is such a big topic and this was so much feedback that I decided not to respond to every idea individually but instead start over and suggest a new plan that incorporates all feedback I saw. Otherwise, I'd be worried that we'd lose ourselves in the details and miss the big picture. Maybe this also makes this topic somewhat easier to follow and respond to, which I hope many people will do.
The problem to solve is still that the Tor Metrics website has a huge product backlog and that we want it to remain "the primary place to learn interesting facts about the Tor network", either by making it better and bigger internally, or by adding more ways to let others contribute to it externally.
- From the feedback in round 1, I observe three major areas where we need to improve Tor Metrics:
1 Metrics frontend 2 Metrics backend 3 External contributions
There are low-hanging fruit in each area, but there are many fruit overall and some are hanging higher than we might think. I'll go through them area by area and assign numbers to the tasks.
1 Metrics frontend
The frontend is the part of the Tor Metrics website that takes pre-aggregated data in the .csv format as input and produces somewhat interactive visualizations. The current frontend uses Java servlets/JSP as web framework and R/ggplot2 for graphs. This code is hard to extend, even for me, and the result isn't pretty, but therefore the website doesn't require any JavaScript.
So, one task would be: #1 decide whether we can still ignore JavaScript and what it has to offer. I agree that D3.js is cool, I even used it myself in the past, though I know very little about it. This decision would mean that we develop new visualizations in D3.js and phase out the existing R/ggplot2 visualizations one by one. This is a tough decision, but one with a lot of potential. I understand how we're excited about this as developers, but I'd want to ask Metrics users about this first.
Another task would be: #2 website redesign. In fact, what you see on the website right now is a redesign half-way through. Believe me, the website was even less readable a year or two ago, and this statement alone tells you how slow things are moving. But the remaining step is just to replace the start page with a gallery, well, and to apply a Bootstrap design to everything, because why not. One challenge here is that the current graphs all look quite similar making them hard to distinguish in a gallery, but that's probably still more useful than putting text there. It sounds like Letty and Thomas might be willing to help out with this, which would be great.
Yet another task would be: #3 replace the website framework with something more recent. This can be something simple, as long as it supports some basic filtering and maybe searching on the start page. I'd say let's pick something Python-based here. However, maybe we should first replace the existing graphs that are deeply tied into the current website framework. If we switch to D3.js and have replaced all existing graphs, this switch to a new website framework will hurt a lot less.
Another high-hanging fruit would be: #4 build something like Thomas' Visionion, where users can create generic visualizations on the fly. This is an exciting idea, really, but I think we have to accept that it's out of reach for now.
2 Metrics backend
The backend of Tor Metrics consists of a bunch of programs that run once per day to fetch new data from CollecTor and produce a set of .csv files for the frontend. There are no strict requirements to languages and databases, as long as tools are available in Debian stable. Some programs use PostgreSQL, but most of them just use files. Ironically, it's the database-based tools that have major performance problems, whereas the file-based ones work just fine. Most programs are written in Java, very few in Python.
One rather low-hanging fruit would be: #5 document backend programs and say what's required to add one more to the bunch. The biggest challenge in writing such a program is that it needs to stay reasonably fast even over the next couple of years and even if the network doubles or triples in size. I started documenting things a while ago, but got distracted by other things. I wouldn't mind help.
Another rather low-hanging fruit would be: #6 use Big Data to produce pre-aggregated data for the frontend. As said before, it doesn't matter whether a backend program uses files or PostgreSQL or another database engine. What matters is that it reads data from CollecTor and produces a data that the frontend can use. This could be a CSV or JSON file. We should probably pick the next visualization project as test case for applying Big Data tools, or we could rewrite one that needs to be rewritten for performance reasons anyway.
Here's a not-so-low-hanging fruit for the backend: #7 have the backend provide an API to the frontend. This is potentially more difficult. The part that I'm very much afraid of is performance. It's just too easy to build a tool that performs reasonable during testing but that can't handle the load of 10 or 100 people looking at a frontend visualization at once. In particular, it's easy to build an API that works just fine and then add another feature that looks harmless, which later turns out to hurt performance a lot. I'd say postpone.
3 External contributions
Most of the discussion in round 1 circled around how external contributions are great, but that they need to be handled with care. I understand the risks here, and I think I'll postpone the part where we're including externally developed websites after "redressing" them. Let's instead try to either keep those contributions as external links or properly integrate them into Tor Metrics.
The lowest-hanging fruit here is: #8 keep adding links to external websites as we already do, assuming that we clearly mark these contributions as external.
Another low-hanging fruit is: #9 add static data or static graphs. I heard a single +1, but I guess once we have concrete suggestions what data or graphs to add, we'll have more discussion. That's fine.
One important and still somewhat low-hanging fruit is: #10 give external developers more support when developing visualizations that could later be added to Metrics. This requires better documentation, but it also requires making it easier to install Tor Metrics locally and test new additions before submitting them. The latter is a good goal, but we're not there yet. The documentation part doesn't seem crazy though. David, if you don't mind being the guinea pig yet once more, I'd want to try this out with your latest visualizations. This is pending on the JavaScript decision though.
Now to the higher-hanging fruit: #11 Build or adapt a tool like Munin to prototype new visualizations quickly. I didn't fully understand how Munin makes it easier to prototype a visualization than just writing a Python/Stem script for the data pre-processing and using any graphing engine for the visualization. But maybe there's something to learn here. Still, this seems like a quite high goal at the moment.
Another one: #12 provide a public API like Onionoo but for Metrics data. This seems somewhat out of reach for the moment. It's a cool idea, but doing it right is not at all trivial. I'd want to put this to the lower end of the list.
Another high-hanging fruit: #13 adapt the Gist idea where external contributors write some code that magically turn into new visualizations. I think that most new visualizations require writing backend code to aggregate different parts of the available data or to aggregate the same data differently. It's a neat idea that external contributors could simply write some code that then magically turns into new visualizations. But I think it's a long way until we're there.
4 Summary
Here's the list of low-hanging fruit:
#1 decide whether we can still ignore JavaScript #2 website redesign #5 document backend programs #6 use Big Data to produce pre-aggregated data #8 keep adding links to external websites #9 add static data or static graphs #10 give external developers more support
And here are the tasks that I think we should postpone a bit longer:
#3 replace the website framework #4 build something like Thomas' Visionion #7 have the backend provide an API to the frontend #11 Build or adapt a tool like Munin #12 provide a public API like Onionoo but for Metrics data #13 adapt the Gist idea
How does this sound? What did I miss (sorry!), on a high level without going into all the details just yet? Who wants to help?
All the best, Karsten
[0] https://lists.torproject.org/pipermail/tor-dev/2015-November/009983.html
On 7 Dec 2015, at 02:52, Karsten Loesing karsten@torproject.org wrote:
Signed PGP part Hi everyone,
I posted some thoughts on Scaling Tor Metrics [0] almost two weeks ago and received very useful feedback from George, David, Thomas, and Letty. Thanks for that! ...
...
So, one task would be: #1 decide whether we can still ignore JavaScript and what it has to offer. I agree that D3.js is cool, I even used it myself in the past, though I know very little about it. This decision would mean that we develop new visualizations in D3.js and phase out the existing R/ggplot2 visualizations one by one. This is a tough decision, but one with a lot of potential. I understand how we're excited about this as developers, but I'd want to ask Metrics users about this first.
I run Tor Browser in high security mode by default. That disables JavaScript on all sites. I like being able to browse metrics without turning JavaScript on (except for the bubble graphs[0]).
But we already require users to turn on JavaScript for the bubble graphs, globe, etc.
So it's not an unreasonable decision to require it.
... One important and still somewhat low-hanging fruit is: #10 give external developers more support when developing visualizations that could later be added to Metrics. This requires better documentation, but it also requires making it easier to install Tor Metrics locally and test new additions before submitting them. The latter is a good goal, but we're not there yet. The documentation part doesn't seem crazy though. David, if you don't mind being the guinea pig yet once more, I'd want to try this out with your latest visualizations. This is pending on the JavaScript decision though.
Do David's visualisations already use JavaScript? We could always do what we did with the bubble graphs, and make (another) part of the metrics site use JavaScript.
Or are we waiting to choose a language before doing any new work?
Tim
[0]: https://metrics.torproject.org/bubbles.html
Tim Wilson-Brown (teor)
teor2345 at gmail dot com PGP 968F094B
teor at blah dot im OTR CAD08081 9755866D 89E2A06F E3558B7F B5A9D14F
Hi,
teor: Do David's visualizations already use JavaScript? We could make (another) part of the metrics site use JavaScript.
Can the data be processed on the host server and sent to the client JS-free?
Or are we waiting to choose a language before doing any new work?
What are the options?
Wordlife, Spencer
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 07/12/15 01:07, Spencer wrote:
Hi,
teor: Do David's visualizations already use JavaScript? We could make (another) part of the metrics site use JavaScript.
Can the data be processed on the host server and sent to the client JS-free?
We briefly discussed making a JavaScript-free Globe a while ago by using Node.js. I'm not sure whether this would also work for Metrics. It may depend on how interactive graphs are supposed to be.
But before we look more into this: do we really have no JavaScript at all? The High Security level in Tor Browser says that JavaScript performance optimizations are disabled and that JavaScript is disabled on all non-HTTPS sites, but in theory, Metrics runs on HTTPS, so the bubble graphs should work in Tor Browser. A possible deal breaker might be that we're fetching data from Onionoo for the bubble graphs. Would we be able to put JavaScript on Metrics if we load data from https://metrics.torproject.org/ instead of https://onionoo.torproject.org/?
Or are we waiting to choose a language before doing any new work?
What are the options?
I think the main option is to keep rendering graphs on the server. Right now, we're using R/ggplot2 for that, but we could switch to server-side JavaScript or really anything else. The main downside is lack of real interactivity.
Thanks for the feedback!
All the best, Karsten
On 7 Dec 2015, at 19:14, Karsten Loesing karsten@torproject.org wrote:
On 07/12/15 01:07, Spencer wrote:
Hi,
teor: Do David's visualizations already use JavaScript? We could make (another) part of the metrics site use JavaScript.
Can the data be processed on the host server and sent to the client JS-free?
We briefly discussed making a JavaScript-free Globe a while ago by using Node.js. I'm not sure whether this would also work for Metrics. It may depend on how interactive graphs are supposed to be.
There are privacy advantages to doing the Globe processing on the client using JavaScript. It's a design that means that user queries are never seen by the server.
But before we look more into this: do we really have no JavaScript at all? The High Security level in Tor Browser says that JavaScript performance optimizations are disabled and that JavaScript is disabled on all non-HTTPS sites, but in theory, Metrics runs on HTTPS, so the bubble graphs should work in Tor Browser.
The Medium-High level disables JavaScript on non-HTTPS sites. The High level disables JavaScript on all sites. (In either case, users can enable it on a site-by-site basis.)
Tim
Tim Wilson-Brown (teor)
teor2345 at gmail dot com PGP 968F094B
teor at blah dot im OTR CAD08081 9755866D 89E2A06F E3558B7F B5A9D14F
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 07/12/15 12:10, Tim Wilson-Brown - teor wrote:
On 7 Dec 2015, at 19:14, Karsten Loesing karsten@torproject.org wrote:
On 07/12/15 01:07, Spencer wrote:
Hi,
teor: Do David's visualizations already use JavaScript? We could make (another) part of the metrics site use JavaScript.
Can the data be processed on the host server and sent to the client JS-free?
We briefly discussed making a JavaScript-free Globe a while ago by using Node.js. I'm not sure whether this would also work for Metrics. It may depend on how interactive graphs are supposed to be.
There are privacy advantages to doing the Globe processing on the client using JavaScript. It's a design that means that user queries are never seen by the server.
Well, queries are still seen by the Onionoo server.
But before we look more into this: do we really have no JavaScript at all? The High Security level in Tor Browser says that JavaScript performance optimizations are disabled and that JavaScript is disabled on all non-HTTPS sites, but in theory, Metrics runs on HTTPS, so the bubble graphs should work in Tor Browser.
The Medium-High level disables JavaScript on non-HTTPS sites. The High level disables JavaScript on all sites. (In either case, users can enable it on a site-by-site basis.)
You're right. Thanks for clarifying this.
All the best, Karsten
Hi,
Karsten Loesing: We briefly discussed making a JavaScript-free Globe a while ago by using Node.js. I'm not sure whether this would also work for Metrics. It may depend on how interactive graphs are supposed to be.
As said later in this thread, .png seems okay. Though I see the load on the server if tons of peeps get at the site; I respect the client-side preference.
Thanks :)
I think the main option is to keep rendering graphs on the server. Right now, we're using R/ggplot2 for that, but we could switch to server-side JavaScript or really anything else. The main downside is lack of real interactivity.
I see the need for interaction :) David McCandless [0] has some cool stuff that isn't very interactive (but uses JS).
Can the data be processed offline by each person? Tor Rendering Engine :P
Wordlife, Spencer
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 09/12/15 04:01, Spencer wrote:
Hi,
Karsten Loesing: We briefly discussed making a JavaScript-free Globe a while ago by using Node.js. I'm not sure whether this would also work for Metrics. It may depend on how interactive graphs are supposed to be.
As said later in this thread, .png seems okay. Though I see the load on the server if tons of peeps get at the site; I respect the client-side preference.
I'm not really worried about server load. At least this hasn't been an issue with the current Metrics website.
Thanks :)
I think the main option is to keep rendering graphs on the server. Right now, we're using R/ggplot2 for that, but we could switch to server-side JavaScript or really anything else. The main downside is lack of real interactivity.
I see the need for interaction :) David McCandless [0] has some cool stuff that isn't very interactive (but uses JS).
There are indeed great visualizations out there, and interactivity isn't everything. But having to go back to the server for each change, including picking a different time period to be displayed in a graph, is really uncool.
If somebody here knows a solution for this problem, that is, generate graphs on the server and still make the result as interactive as possible on the client, I'd love to hear suggestions.
Can the data be processed offline by each person? Tor Rendering Engine :P
Not if we want to build tools for more than a handful of people. :)
Thanks for the feedback!
All the best, Karsten
Wordlife, Spencer
_______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Hi,
Karsten Loesing: I'm not really worried about server load. At least this hasn't been an issue with the current Metrics website.
Word.
There are indeed great visualizations out there
If you know of anyone dedicating their life to dataviz, please point, as it seems David McCandless is the only one going this hard, and I am always on the look out for some fresh illness.
having to go back to the server for each change ... is really uncool.
I agree. However, given the challenges touched on so far, it seems that a reasonable resolution may be to select the data points and jump offline (or into some sandbox) and go to town on all kinds of combos.
Not if we want to build tools for more than a handful of people. :)
It is unclear what you mean. Most devices can process the data, I presume, and usability is okay with using downloads in 'Work Offline' mode. I am not sure what issue you are referring to :(
Though I am all for the one-stop-shop, it seems that a feed of all the updated data upon request, to be processed at will, might be a nice way to do this :)
Food for thought.
Wordlife, Spencer
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 09/12/15 18:18, Spencer wrote:
Hi,
Hello Spencer,
Karsten Loesing: I'm not really worried about server load. At least this hasn't been an issue with the current Metrics website.
Word.
There are indeed great visualizations out there
If you know of anyone dedicating their life to dataviz, please point, as it seems David McCandless is the only one going this hard, and I am always on the look out for some fresh illness.
Ah, I wasn't referring to anyone specifically. I'm also a fan of David McCandless' work and have his book on the shelf here. :) (Next to the wonderful books of Edward Tufte and the great ggplot2 book of Hadley Wickham.)
having to go back to the server for each change ... is really uncool.
I agree. However, given the challenges touched on so far, it seems that a reasonable resolution may be to select the data points and jump offline (or into some sandbox) and go to town on all kinds of combos.
Can you elaborate on that?
Not if we want to build tools for more than a handful of people. :)
It is unclear what you mean. Most devices can process the data, I presume, and usability is okay with using downloads in 'Work Offline' mode. I am not sure what issue you are referring to :(
Though I am all for the one-stop-shop, it seems that a feed of all the updated data upon request, to be processed at will, might be a nice way to do this :)
Food for thought.
It might well be that we're talking about different things. When you said offline I was thinking of somebody downloading a tool, like a Python script, to process data outside of the browser. That's what I meant by building tools for a handful of people. I think if it requires more than the browser, hardly anybody will use it.
But it seems you're talking about something different, right? Curious to learn more.
All the best, Karsten
Wordlife, Spencer
_______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Hi,
Karsten Loesing: Ah, I wasn't referring to anyone specifically. I'm also a fan of David McCandless' work and have his book on the shelf here :)
There are two; a new one as of last year.
(Next to the wonderful books of Edward Tufte and the great ggplot2 book of Hadley Wickham.)
These professors have a dry, but excitingly technical, approach to their work; I dig it. Thanks! I will definitely be playing with Wickham's stuff :)
Spencer: select the data points and jump offline
Can you elaborate on that?
It might well be that we're talking about different things. When you said offline I was thinking of somebody downloading a tool, like a Python script, to process data outside of the browser. That's what I meant by building tools for a handful of people. I think if it requires more than the browser, hardly anybody will use it.
But it seems you're talking about something different, right? Curious to learn more.
Maybe. Downloading the dependencies can cause many usability and security issues, so I agree with the example .py context you provided.
If the data can be selected, individually or all, and cached for offline use, it seems that an included .css and .js could style and render everything on the fly.
Quick searches show ngraph[0], appcache[1], and Google[2] have some related things.
Wordlife, Spencer
[0]: https://github.com/anvaka/ngraph [1]: http://sitepoint.com/creating-offline-html5-apps-with-appcache/ [2]: https://developers.google.com/chart/interactive/faq
On 06 Dec (16:52:45), Karsten Loesing wrote:
Hi everyone,
[snip]
One important and still somewhat low-hanging fruit is: #10 give external developers more support when developing visualizations that could later be added to Metrics. This requires better documentation, but it also requires making it easier to install Tor Metrics locally and test new additions before submitting them. The latter is a good goal, but we're not there yet. The documentation part doesn't seem crazy though. David, if you don't mind being the guinea pig yet once more, I'd want to try this out with your latest visualizations. This is pending on the JavaScript decision though.
The current viz. I have are all generated by a Munin server which every 5 minutes collect data points on the "munin node" and generates graph (PNG). So as a client accessing the server, you only have to fetch a PNG, all the CPU work for the graph is done on the server side.
It's indeed the JS vs non JS discussion where you basically want to put the load on the client side instead of the server.
Please expand on the what would be required of me for this guinea pig experiment? :)
Thanks! David
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 07/12/15 17:31, David Goulet wrote:
On 06 Dec (16:52:45), Karsten Loesing wrote:
Hi everyone,
[snip]
One important and still somewhat low-hanging fruit is: #10 give external developers more support when developing visualizations that could later be added to Metrics. This requires better documentation, but it also requires making it easier to install Tor Metrics locally and test new additions before submitting them. The latter is a good goal, but we're not there yet. The documentation part doesn't seem crazy though. David, if you don't mind being the guinea pig yet once more, I'd want to try this out with your latest visualizations. This is pending on the JavaScript decision though.
The current viz. I have are all generated by a Munin server which every 5 minutes collect data points on the "munin node" and generates graph (PNG). So as a client accessing the server, you only have to fetch a PNG, all the CPU work for the graph is done on the server side.
It's indeed the JS vs non JS discussion where you basically want to put the load on the client side instead of the server.
Please expand on the what would be required of me for this guinea pig experiment? :)
Hi David,
the following description only applies if you want your visualization to be part of Metrics. We could also start by adding them as "external" visualizations by linking to your server. But let me expand on the scenario where you want it to be part of Metrics.
The Munin model that you describe sounds very simple, but it lacks an important property: users cannot modify graphs other than picking graphs for different time periods. The graphs on Metrics don't have this limitation, but of course that doesn't come for free.
All graphs on Metrics require two parts: the first part aggregates data every 24 hours that is sufficient to produce any graphs you'd want users to be able to create, and the second part draws new graphs based on user input.
So, we'll have to split up your code into one part that produces a .csv file (or related format) and another part that draws graphs.
There are very few requirements for writing the first part of the code. I'm calling that code a data-aggregating module in Metrics. Maybe look at a module that I wrote quite recently:
https://gitweb.torproject.org/metrics-web.git/tree/modules/connbidirect
Note that this code could also be written in Python using Stem for descriptor parsing. As long as cron can call it on the command line, all is good. Of course, it shouldn't require endless amounts of memory, and it should ideally be done within minutes, but we could talk about those requirements. Another requirement is that it's usable enough that somebody who didn't write the code can run it and fix the most trivial problems.
Ideally, you'd be able to re-use 90% of your current code for this first part. The only different is that you wouldn't have Munin to produce a graph, but that you'd output a .csv file with what go into Munin's graphs.
So, there's less flexibility about the second part of the code that generates graphs. If we want to use the current graphing engine, we'll have to write some R/ggplot2 code and extend the Java servlets/JSPs. This is not crazy talk, but it's probably going to require half a day or more of a metrics person.
If we had a decision to switch to JavaScript, that would be different. In that case you could write the graphing code using D3.js, test it locally, and once you like it, we'd copy it over to Metrics. But we're not there yet, nor do I know how fast we're moving forward there. That's why I'd suggest going with R/ggplot2. This code is probably written quite fast.
What do you think? Want to give this a try, maybe starting with your favorite visualization?
All the best, Karsten