Other questions I'd want to investigate:
(A) Are the failures consistent, or intermittent? That is, does a failed link always fail, or only sometimes?
Yes this is what our new testing methodology should support. My current scanner is not sufficient. We want to improve it.
(B) Are you really sure that it failed? I would guess that 'failed' is different from 'timeout' because it got an explicit destroy back? If so, don't destroy cells have 'reason' components? Which reasons are happening most commonly?
Yes I am sure it failed. It would be cool if txtorcon can expose the 'reason' but I think that it cannot. I suppose it will show up in the tor log file if I set it to debug logging.
(C) We should find a link that is failing between two relays that we both control, and look at each one more closely to see if there are any hints. For example, is there anything in the logs? If we turn up the logging, do we get any hints then?
Sounds good. I would certainly be willing to collaborate with Teor or anyone else who might like to help with this.
(D) ...which leads to: we should run this same tool on the test network that teor and dgoulet et al run, and look for failures there. Assuming we find some, since there are no users on the test network, we can investigate much more thoroughly.
Sounds good. Let me know if there is anything I can do to help with this.
(E) I wonder if there's a correlation between the failed links and whether a TLS connection is already established on that link. That is, when there is no connection already, there are many more steps that need to be taken to extend the circuit, and those steps could lead to increased failure rates, either due to the extra time that is needed, or because part of tor's link handshake (NETINFO, etc) is going wrong.
Ah yes this is another good question for which I currently do not have an answer.