Howdy all.
I would like to make a mnemonic translation method for onion URLs. Not as in a full petname system with registration and so forth, just a "four little words" style way of rendering and entering existing 80-bit hashes in a way that is memorable to not just detect partial-overlap fuzzing attacks but actually memorize and produce the URL directly.
I'm not yet ready to make a formal proposal for serious review, but here is an edit-authorized link to a draft document which has a bit more detail. Feel free to edit and comment there. https://docs.google.com/document/d/1IFMnMGh8ZqJIXYBhZOK4IW5KKG0fTatpTJiTS67n...
I'll update tor-dev when I feel it's ready for review as an actual draft proposal; it's certainly not there yet. This is just meant as a courtesy notice in case you're interested enough to help make it, and following up from a discussion earlier today on tor-assistants.
Cheers, Sai
On Tue, Dec 20, 2011 at 11:20 PM, Sai tor@saizai.com wrote:
Howdy all.
I would like to make a mnemonic translation method for onion URLs. Not as in a full petname system with registration and so forth, just a "four little words" style way of rendering and entering existing 80-bit hashes in a way that is memorable to not just detect partial-overlap fuzzing attacks but actually memorize and produce the URL directly.
So, the last time I was talking about this (with a grad student whose name I have just asked for permission to cite), I came up with the following scheme. Let BW be the number of bits per natural-language word; let BH be the number of bits per hidden service identity key hash. Let H() be some hash function. Represent the onion key identity hash as a sequence of BH/BW natural-language words, letting each word W represent the first BW bits of H(W).
So if H() is SHA256, and BW is 20, and BH is 80, the hostname bookbind.electrodynamism.unsurnamed.squibbish would represent the 80-bit sequence Ht("bookbind") Ht("electrodynamism") H("unsurnamed") H("squibbish") where Ht(x) is the first 20 bits of SHA256(x).
(In practice, 80 bits is feeling way too short these days; but that's another discussion.)
The main advantages and disadvantages of this idea are:
+ You don't need to ship a shared wordlist with the client software. (That's a good thing, since big dictionaries get annoying fast.) + You don't need to restrict yourself to any particular language; it is not English-centric. - If bw is large enough, there will be sequences of bits that are not represented as the hash of any particular English word. (e.g., if bw is 32, you are in trouble: no language has 4 billion words!) You can work around this by having your search program generate new english-sounding sequences, by using a bigger wordlist or by generating keys until you find one whose fingerprint *is* representable with words you have. - If you make BW big, you are going to get some pretty obscure words. - Encodings are not unique. + You can get a lot of bits per word. If you're willing to do stuff like "every word in english" x "the numbers 00-99", you can get even more. + It scales to more than 80 bits pretty easily. - It doesn't make pretty sentences by default, though you can keep generating words till you find something you like the sound of. - We'd need to watch out for words that look equivalent to a human, but aren't. / You actually need to write a reasonably good search program. This isn't too hard, and the database it generates isn't too big.
There are other issues and other possible solutions too.
yrs,
A couple issues:
1. There's no such thing as "bits per word" without a dictionary, or better yet, a frequency-sorted dictionary. :-P (See Alex's comments in https://plus.google.com/u/0/103112149634414554669/posts/FMDg2Vht8sb for a good idea on this.)
For reference, the OED is ~250k words (~18 bits at most, without considering frequency).
2. People cannot remember a random ordered set of even four rare words. Consider for instance that memory competitions involve doing *exactly that* and that untrained people generally can't retain even a small number. What people do in competitions, and what I've proposed in the draft, is remember *scenes* — e.g. something you can state in a sentence.
Also, you can't use words that people don't know. "Squibbish" might as well be "kaftorn". Merely being a word is not sufficient; it has to be something with semantic value. "English-sounding" is useless except for the very low bar of "can be spoken over a telephone".
This of course constrains the dictionary considerably, such that assembling even 12 bits (4k) is not completely trivial. But remember, the point of my proposal is *not* "can you output some random crap that's word-ish" à la the horrible RFC2289. It's "can you give people something they'll actually remember".
3. People are not going to remember whether it was, say, 48 or 49 or 84. Numbers are effectively useless for recall. People don't even think in decimal, they think in very rough logarithmic approximations. :-P
4. How would you 'keep generating words'? My proposal was that every hash have a single, canonical phrasal mapping. You would not be able to change it unless you change your hidden service descriptor. (You could do that if you're willing to "move" your site, or are making a new one.)
If you don't have canonical mapping, siteops can't tell people what their phrasal descriptor is, which means you lose a majority of its utility for security.
For that matter, how would you ship it without a dictionary? Are you expecting that most users will already have a common dictionary file?
5. Not having it be in English means you have to support internationalized domain names (nearly all other languages have non-ASCII characters). AIUI that's just not technologically up to speed yet.
I should emphasize that this is *not* a technical problem, it's a problem of cognitive linguistics. Writing the code is relatively trivial; it's just a simple parser / generator layer.
- Sai
On Wed, Dec 21, 2011 at 00:16, Nick Mathewson nickm@alum.mit.edu wrote:
On Tue, Dec 20, 2011 at 11:20 PM, Sai tor@saizai.com wrote:
Howdy all.
I would like to make a mnemonic translation method for onion URLs. Not as in a full petname system with registration and so forth, just a "four little words" style way of rendering and entering existing 80-bit hashes in a way that is memorable to not just detect partial-overlap fuzzing attacks but actually memorize and produce the URL directly.
So, the last time I was talking about this (with a grad student whose name I have just asked for permission to cite), I came up with the following scheme. Let BW be the number of bits per natural-language word; let BH be the number of bits per hidden service identity key hash. Let H() be some hash function. Represent the onion key identity hash as a sequence of BH/BW natural-language words, letting each word W represent the first BW bits of H(W).
So if H() is SHA256, and BW is 20, and BH is 80, the hostname bookbind.electrodynamism.unsurnamed.squibbish would represent the 80-bit sequence Ht("bookbind") Ht("electrodynamism") H("unsurnamed") H("squibbish") where Ht(x) is the first 20 bits of SHA256(x).
(In practice, 80 bits is feeling way too short these days; but that's another discussion.)
The main advantages and disadvantages of this idea are:
+ You don't need to ship a shared wordlist with the client software. (That's a good thing, since big dictionaries get annoying fast.) + You don't need to restrict yourself to any particular language; it is not English-centric. - If bw is large enough, there will be sequences of bits that are not represented as the hash of any particular English word. (e.g., if bw is 32, you are in trouble: no language has 4 billion words!) You can work around this by having your search program generate new english-sounding sequences, by using a bigger wordlist or by generating keys until you find one whose fingerprint *is* representable with words you have. - If you make BW big, you are going to get some pretty obscure words. - Encodings are not unique. + You can get a lot of bits per word. If you're willing to do stuff like "every word in english" x "the numbers 00-99", you can get even more. + It scales to more than 80 bits pretty easily. - It doesn't make pretty sentences by default, though you can keep generating words till you find something you like the sound of. - We'd need to watch out for words that look equivalent to a human, but aren't. / You actually need to write a reasonably good search program. This isn't too hard, and the database it generates isn't too big.
There are other issues and other possible solutions too.
yrs,
Nick _______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev