Erik and I have finished the proc.py integration tests
Great! I'm looking forward to checking them out.
Tomorrow, we will be starting the next Stem task regarding the Tor Export project. Do you have any recommendations on where we should start exploring in the Stem documentation?
Hmm, this concerns translating Descriptor subclasses [1] into a csv (and I suppose loading them back up). First step would be to google around to see if there's a builtin function what'll do this or help with it. If not then this might simply include dumping out their __dict__...
class Foo:
... def __init__(self): ... self.foo = 5 ... self.bar = "hello" ... def set_bar(self, value): ... self.bar = value ...
f = Foo() f.__dict__
{'foo': 5, 'bar': 'hello'}
",".join([str(v) for v in f.__dict__.values()])
'5,hello'
You probably want to sort the keys and use that order, but besides that this should do the trick. Just add a unit test and you're done. :)
If we want to convert csvs *back* into Descriptors then that's a bit more work.
Also, what do you anticipate being necessary when it comes to parsing consensus entries?
A NetworkStatusEntry class (and its tests) would be similar to the ServerDescriptor [2] and ExtraInfoDescriptor [3], but a fair bit easier (it only has six entries). To implement this you'd...
1. Read over the ServerDescriptor and ExtraInfoDescriptor to figure out how we're tackling the parsing and verification in python. 2. Read over Karsten's NetworkStatusEntryImpl class [4] to see how he's handling this in metrics-lib. 3. Read the spec for these entries [5]. Do we need both a NetworkStatusEntryV2 and NetworkStatusEntryV3 class? Exactly how should we model this? Karsten: thoughts? 4. Go to 'https://metrics.torproject.org/data.html' to get some test data, so you see what the things that you'll be parsing looks like. 5. Write a NetworkStatusEntry class with the goal of "figure out each and every way that data might be malformed". A large part of this class is to verify that the data we're being given is perfectly valid according to the spec, so read the spec and implement exactly what it says. If there's some ambiguity in the spec or you see data that doesn't conform to the spec then let us know. That's a tor bug. 6. Write unit and integ tests similar to what the other descriptors have... - unit tests are the majority of the work, and exercise all the use cases that you can think of against mock objects - integ tests are pretty short, and just run the parser against some test data from the metrics archive and the cached consensus
Let me know if you have any questions! -Damian
PS. Karsten: Do we want to call this "NetworkStatusEntry" or something else? It seems like "ConsensusEntry" would be more intuitive, but maybe this would just spawn confusion.
[1] https://gitweb.torproject.org/stem.git/blob/HEAD:/stem/descriptor/__init__.p... [2] https://gitweb.torproject.org/stem.git/blob/HEAD:/stem/descriptor/server_des... [3] https://gitweb.torproject.org/stem.git/blob/HEAD:/stem/descriptor/extrainfo_... [4] https://gitweb.torproject.org/metrics-lib.git/blob/HEAD:/src/org/torproject/... [5] https://gitweb.torproject.org/torspec.git/blob/HEAD:/dir-spec.txt#l1334
On 6/28/12 11:37 PM, Damian Johnson wrote:
- integ tests are pretty short, and just run the parser against some
test data from the metrics archive and the cached consensus
Keep in mind that metrics tarballs can be huge. stem's tests probably shouldn't download one or more of these tarballs in an automatic integ test run.
PS. Karsten: Do we want to call this "NetworkStatusEntry" or something else? It seems like "ConsensusEntry" would be more intuitive, but maybe this would just spawn confusion.
A consensus is just one type of a network status. Other types are: - votes, - opinions (specified in proposal 147 and targeted for 0.2.4.x), - microdescriptor consensuses, - sanitized bridge network statuses, - v2 network statuses, and - v1 directories (which are quite different though).
So, for the Java metrics-lib I went with a single NetworkStatusEntry class for everything except v1 directory entries, but that won't scale forever. For example, "r" lines in consensuses are different from "r" lines in microdescriptor consensuses. The Java metrics-lib doesn't understand microdescriptor consensuses, because they don't contain anything new for statistical analysis, but I think stem will want to parse them. It probably makes sense to have an abstract NetworkStatusEntry class that does most of the parsing work but that can be specialized in its subclasses. Picking names like ConsensusEntry if the consensus class is called Consensus makes sense. If there's a similar concept to Java's inner classes in Python, maybe using something like Consensus.Entry might be a good choice, too, because this class will only be used as part of a Consensus.
Best, Karsten
2012/6/28 Damian Johnson atagar@torproject.org:
Erik and I have finished the proc.py integration tests
Great! I'm looking forward to checking them out.
Tomorrow, we will be starting the next Stem task regarding the Tor Export project. Do you have any recommendations on where we should start exploring in the Stem documentation?
Hmm, this concerns translating Descriptor subclasses [1] into a csv (and I suppose loading them back up). First step would be to google around to see if there's a builtin function what'll do this or help with it. If not then this might simply include dumping out their __dict__...
Please use the built-in function vars() instead of __dict__ to retrive instance attributes.
Keep in mind that metrics tarballs can be huge. stem's tests probably shouldn't download one or more of these tarballs in an automatic integ test run.
Oops yup. Should have mentioned that. We're just picking out a descriptor that seems to exercise most of the parsing. This is just for a sanity check that 'we can still parse something found in the wild'. Megan, Erik: the layout should be pretty obvious when you take a peek in test/integ/descriptor/data/*.
The Java metrics-lib doesn't understand microdescriptor consensuses, because they don't contain anything new for statistical analysis, but I think stem will want to parse them.
Definitely. Microdescriptors are available via the control protocol so we need to be able to parse them.
It probably makes sense to have an abstract NetworkStatusEntry class that does most of the parsing work but that can be specialized in its subclasses. Picking names like ConsensusEntry if the consensus class is called Consensus makes sense.
Perfect, thanks. Megan, Erik: if I was in your shoes the first thing that I'd do to approach this is propose the following on this list... - an object hierarchy (we already have a bit of one, ex. ServerDescriptor vs RelayDescriptor/BridgeDescriptor) - a description for each of the classes, preferably something meaty that we can use for the pydocs of each class with the :var: entries - your thoughts on which parsing logic should go where (look at the previous descriptor classes for a pattern that you might want to follow)
If there's a similar concept to Java's inner classes in Python, maybe using something like Consensus.Entry might be a good choice, too, because this class will only be used as part of a Consensus.
Yup, there is.
class Foo:
... class Bar: ... def __init__(self): ... self.my_value = 5 ... def __init__(self): ... self.my_bar = Foo.Bar() ...
f = Foo() f.my_bar.my_value
5
A related question: can you give us a couple of use-cases for the export functionality? E.g., is filtering (we only want fields X, Y, and Z when Q = ...) likely to be of use? Anything beyond just a straight dump of descriptor/network status/etc entries?
I'll mostly leave this question for Fabio since the csv dumping functionality was his idea, though my thoughts on some use cases are...
- user writes a script that has stem parse the descriptors, filter the results (say, down to Syrian exit relays), then dumps to a csv so they can make pretty graphs or do other analysis of the data
- user has a python script that hourly parses their cached descriptors to get any new exits that only allow plaintext traffic, then dump just the fingerprint and ip to a csv so they can later be scanned for malicious activity
Please use the built-in function vars() instead of __dict__ to retrive instance attributes.
Ah ha, thanks.