Mnemonic 80-bit phrases (proposal)

List overview All Threads
Download

newer

older

Re: [tor-dev] Proposal 198:...

A modest proposal for a petname...

Sai

28 Feb 2012 28 Feb '12

9:27 p.m.

Hello all.

We've written up our proposal for mnemonic .onion URLs.

See https://docs.google.com/document/d/sT5CulCVl0X5JeOv4W_wC_A/edit?disco=AAAAAE... for details; please read the full intro for explanations and caveats, as some are important.

tl;dr: It's a system that would have all three properties of being secure, distributed, and human-meaningful… but would *not* also have choice of name (though it has the required *canonicality* of names), and has a somewhat absurdist definition of 'meaningful'. :-P

Please feel free to put comments there or on list.

Right now we're at the stage just before implementation; namely, we haven't yet collated the necessary dictionaries, but we have a reasonably good idea of how the system would work, including needed constraints on the dictionaries. If you have suggestions or comments, now is a good time to talk about them, so that if any of it affects the dictionary collation step we don't waste work.

Thanks, Sai

Show replies by date

Robert Ransom

28 Feb 28 Feb

10:53 p.m.

On 2012-02-28, Sai tor@saizai.com wrote:

...

Hello all.

We've written up our proposal for mnemonic .onion URLs.

See https://docs.google.com/document/d/sT5CulCVl0X5JeOv4W_wC_A/edit?disco=AAAAAE... for details; please read the full intro for explanations and caveats, as some are important.

I'm not going to follow that link. (Tor specification-change proposals are sent to the tor-dev mailing list in their entirety and copied into a Git repository for archival, not left on an easily-changed web page.)

...

tl;dr: It's a system that would have all three properties of being secure, distributed, and human-meaningful… but would *not* also have

We do not care whether names are ‘human-meaningful’. (“Tor” is not a human-meaningful name.)

We would like a naming system which provides *memorable* names, if that is possible. (I've never seen a distributed naming system which provides secure and memorable names.)

But we care even more about other usability properties of a naming system, such as how easily users can type a name given a copy of it on paper, how easily users can transfer a name to a friend over the telephone, and how easily users can compare two names maliciously crafted by an attacker with plausible computational power to be similar (whether in written form or in spoken form).

...

choice of name (though it has the required *canonicality* of names),

By proposing to add a new naming system for Tor's existing hidden service protocol, you are already assuming and claiming that hidden service names do not need to be canonical. Why do you think ‘canonicality’ is required?

...

and has a somewhat absurdist definition of 'meaningful'. :-P

Then your system's names are unlikely to be memorable.

...

Please feel free to put comments there or on list.

Right now we're at the stage just before implementation; namely, we haven't yet collated the necessary dictionaries, but we have a reasonably good idea of how the system would work, including needed constraints on the dictionaries. If you have suggestions or comments, now is a good time to talk about them, so that if any of it affects the dictionary collation step we don't waste work.

The dictionaries required by a dictionary-based naming system strongly influence whether the resulting names will be memorable. The usability tests which will prove that your scheme does not provide sufficient usability benefit to justify shipping many large dictionaries with Tor cannot begin until after you have collected the dictionaries.

Robert Ransom

Sai

11:22 p.m.

On Tue, Feb 28, 2012 at 17:53, Robert Ransom rransom.8774@gmail.com wrote:

...

I'm not going to follow that link.

… yet you're going to comment anyway, based merely on your imagination of what it contains? O.o

...

(Tor specification-change proposals are sent to the tor-dev mailing list in their entirety and copied into a Git repository for archival, not left on an easily-changed web page.)

Yeah: that's the point. This is a proposal, not a full implementation let alone a final one. It's going to be edited.

...

We would like a naming system which provides *memorable* names, if that is possible. (I've never seen a distributed naming system which provides secure and memorable names.)

"I've never seen" isn't really a statement about my proposal.

...

But we care even more about other usability properties of a naming system, such as how easily users can type a name given a copy of it on paper, how easily users can transfer a name to a friend over the telephone, and how easily users can compare two names maliciously crafted by an attacker with plausible computational power to be similar (whether in written form or in spoken form).

All agreed there.

...

...
choice of name (though it has the required *canonicality* of names),

By proposing to add a new naming system for Tor's existing hidden service protocol, you are already assuming and claiming that hidden service names do not need to be canonical. Why do you think ‘canonicality’ is required?

… you just contradicted yourself within two sentences.

Canonicality is mandatory for domain names of all kinds; otherwise there's no way to advertise them, transfer references to them between users, etc. If your name for some service only works for you, it's not very useful.

...

...
and has a somewhat absurdist definition of 'meaningful'. :-P

Then your system's names are unlikely to be memorable.

Not true. Consider that e.g. mnemonics used in med school *all* consist of absurdist phrases.

It would be more memorable if it's short and operator-specified, but for that you need a petname system, which this is not.

...

The dictionaries required by a dictionary-based naming system strongly influence whether the resulting names will be memorable.

Yes, of course. So will using good syntax generation.

...

The usability tests which will prove that your scheme does not provide sufficient usability benefit to justify shipping many large dictionaries with Tor cannot begin until after you have collected the dictionaries.

a) who said it requires 'many large dictionaries'? b) I said upfront that the point of asking for comments is to make sure the dictionaries collected are good a priori. Your challenging my proposal by saying that we need dictionaries before testing — which is obvious; you can't implement this scheme without dictionaries — seems pointlessly combative to me.

I suggest you try actually reading proposals before bitching about them. We addressed most of the issues you mention in the proposal.

- Sai

Ahmed Hassan

11:13 p.m.

On Tue, 2012-02-28 at 16:27 -0500, Sai wrote:

...

We've written up our proposal

Hmm, I wrote one last week. I described its mathematical implementation. I got some replies that say it's cool idea, and some other troll.

https://lists.torproject.org/pipermail/tor-talk/2012-February/023376.html

Well, I can implement it as one C\C++ header file if any of tor-devs interested in it.

Sai, I can implement it with you if you want. You mentioned some good points about such as "Have a fixed number of template sentences", but you almost didn't mention how to implement your proposal.

You need to have 80 bits space to avoid collision.

-- Ahmed Hassan Ahmed@Linuxism.com

Sai

11:36 p.m.

On Tue, Feb 28, 2012 at 18:13, Ahmed Hassan ahmed@linuxism.com wrote:

...

You mentioned some good points about such as "Have a fixed number of template sentences", but you almost didn't mention how to implement your proposal.

We didn't elaborate at length because we thought the implementation details of that weren't important for understanding the idea at this stage. We will of course provide a demo implementation (probably in Ruby or Perl) once we have the dictionaries compiled. I don't know the Privoxy source at all, so would need help in translating it to work there, but that can be done afterwards.

Compiling the dictionaries is a significant amount of work (cf our requirements list, which eg RFC1751 & RFC2289 utterly fail). That's why we're stopping for input now. It's also important to have the dictionaries compiled before we can generate a full set of templates; there are some complicated linguistic interconstraints that make just generating one without the other a bad idea.

Testing can be done in various ways, e.g. just asking subjects to remember (both for recognition and input) a random hash-phrase for 1h, 1d, or 1w, before full deployment.

- Sai

Brandon Wiley

29 Feb 29 Feb

1:04 a.m.

Hi Sai,

It looks like you've put a lot of thought into what would make a good hash-to-word system. However, you have a false assumption, that dictionary systems can simultaneously have all three properties of Zooko's Triangle. This is a popular idea, but unfortunately untrue.

Hashes are effectively random and so have maximum information density. Words do not have maximum information density, they have redundancy, which is why they are easier to remember and tell apart from each other than random strings. However, this comes at the cost of making the words longer. The more redundant information that you add in terms of constraints such as part of speech, the longer you will need to make the words (on average) so that they can contain this additional information. If you look at the 4 little words post you will notice that the phrases are about 5 characters longer than the IPv4 addresses. Of course you could make the claim that sometimes longer strings are easier to remember than shorter ones. There is an intuitive appeal to the idea that words are more memorable than hexadecimal strings (or base64 or whatever). That might be true sometimes for special cases, but there is no evidence that it is true generally or in this particular case.

On Tue, Feb 28, 2012 at 3:27 PM, Sai tor@saizai.com wrote:

...

Hello all.

We've written up our proposal for mnemonic .onion URLs.

See https://docs.google.com/document/d/sT5CulCVl0X5JeOv4W_wC_A/edit?disco=AAAAAE... for details; please read the full intro for explanations and caveats, as some are important.

tl;dr: It's a system that would have all three properties of being secure, distributed, and human-meaningful… but would *not* also have choice of name (though it has the required *canonicality* of names), and has a somewhat absurdist definition of 'meaningful'. :-P

Please feel free to put comments there or on list.

Right now we're at the stage just before implementation; namely, we haven't yet collated the necessary dictionaries, but we have a reasonably good idea of how the system would work, including needed constraints on the dictionaries. If you have suggestions or comments, now is a good time to talk about them, so that if any of it affects the dictionary collation step we don't waste work.

Thanks, Sai _______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev

Sai

1:50 a.m.

On Tue, Feb 28, 2012 at 20:04, Brandon Wiley brandon@blanu.net wrote:

...

However, you have a false assumption, that dictionary systems can simultaneously have all three properties of Zooko's Triangle.

This isn't an "assumption", but rather an actual claim about the properties of the specific system I proposed (within the caveats I gave).

If you want to refute it, go ahead, but do so by challenging a given property I'm claiming as applied to my proposal, rather than merely dismissing it out of hand on the basis of an axiomatic belief that it's impossible to have the conjunction.

...

Hashes are effectively random and so have maximum information density. Words do not have maximum information density, they have redundancy, which is why they are easier to remember and tell apart from each other than random strings. However, this comes at the cost of making the words longer. The more redundant information that you add in terms of constraints such as part of speech, the longer you will need to make the words (on average) so that they can contain this additional information. If you look at the 4 little words post you will notice that the phrases are about 5 characters longer than the IPv4 addresses.

The entire point of the encoding I propose is to do exactly this — convert something that's dense but impossible to remember into something that's less dense in expression but easier in cognitive/linguistic encoding.

However, you're making a false assumption: that string length is an appropriate measure for effective storage requirements.

It's not, when it comes to human cognition.

"Maximum information density" is a useful concept when talking about computer storage, but it is grossly misleading when talking about human memory, and the latter is what this proposal is intending to optimize for.

...

Of course you could make the claim that sometimes longer strings are easier to remember than shorter ones.

Indeed I do, and plenty of cogsci research supports this claim. How easy something is to remember has much more to do with exactly what the constraints are on memory (eg whether synonyms are treated as distinct), how networked it is to previously remembered information, state cues, etc.

...

There is an intuitive appeal to the idea that words are more memorable than hexadecimal strings (or base64 or whatever). That might be true sometimes for special cases, but there is no evidence that it is true generally or in this particular case.

Have you studied cognitive science or cognitive linguistics? This is extremely well established research.

As one example you may find more tractable, take a look at the "memory palace" technique — extremely effective, BTW, and used by people who do memory competitions. It uses exactly this kind of "expansion" to transform something that's very hard to remember (eg the complete order of a deck of cards, or a long series of phone numbers) into something that's easier (eg the various participants in a silly scenario).

My proposal does essentially the same thing, except in a way that also fulfills the secure (i.e. canonically interconvertable with a given hash) and distributed (i.e. not reliant on any centralized or even coordinated authority). It simply assigns a memorable scenario / phrase to each hash.

- Sai

Sai

5:39 p.m.

Authors

Sai, Alex Fink

Overview

Currently, canonical Tor .onion URLs consist of a naked 80-bit hash[1]. This is not something that users can even recognize for validity, let alone produce directly. It is vulnerable to partial-match fuzzing attacks[2], where a would-be MITM attacker generates a very similar hash and uses various social engineering, wiki poisoning, or other methods to trick the user into visiting the spoof site.

This proposal gives an alternative method for displaying and entering .onion and other URLs, such that they will be easily remembered and generated by end users, and easily published by hidden service websites, without any dependency on a full domain name type system like e.g. namecoin[3]. This makes it easier to implement (requiring only a change in the proxy).

This proposal could equally be used for IPv4, IPv6, etc, if normal DNS is for some reason untrusted.

This is not a petname system[4], in that it does not allow service providers or users[5] to associate a name of their choosing to an address[6]. Rather, it is a mnemonic system that encodes the 80 bit .onion address into a meaningful[7] and memorable sentence. A full petname system (based on registration of some kind, and allowing for shorter, service-chosen URLs) can be implemented in parallel[8].

This system has the three properties of being secure, distributed, and human-meaningful — it just doesn't also have choice of name (except of course by brute force creation of multiple keys to see if one has an encoding the operator likes).

This is inspired by Jonathan Ackerman's "Four Little Words" proposal[9] for doing the same thing with IPv4 addresses. We just need to handle 80+ bits, not just 32 bits.

It is similar to Markus Jakobsson & Ruj Akavipat's FastWord system[10], except that it does not permit user choice of passphrase, does not know what URL a user will enter (vs verifying against a single stored password), and again has to encode significantly more data.

This is also similar to RFC1751[11], RFC2289[12], and multiple other fingerprint encoding systems[13] (e.g. PGPfone[14] using the PGP wordlist[15], and Arturo Filatsò's OnionURL[16]), but we aim to make something that's as easy as possible for users to remember — and significantly easier than just a list of words or pseudowords, which we consider only useful as an active confirmation tool, not as something that can be fully memorized and recalled, like a normal domain name.

Requirements

1. encodes at least 80 bits of random data (preferably more, eg for a checksum) 2. valid, visualizable English sentence — not just a series of words[17] 3. words are common enough that non-native speakers and bad spellers will have minimum difficulty remembering and producing (perhaps with some spellcheck help) 4. not syntactically confusable (e.g. order should not matter) 5. short enough to be easily memorized and fully recalled at will, not just recognized 6. no dependency on an external service 7. dictionary size small enough to be reasonable for end users to download as part of the onion package 8. consistent across users (so that websites can e.g. reinforce their random hash's phrase with a clever drawing) 9. not create offensive sentences that service providers will reject 10. resistant against semantic fuzzing (e.g. by having uniqueness against WordNet synsets[18])

Possible implementations

1. Have a fixed number of template sentences, such as: 1. Adj subj adv vtrans adj obj 2. Subj and subj vtrans adj obj 3. … etc

For a 6 word sentence, with 8 (3b) templates, we need ~12b (4k word) dictionaries for each word category.

If multiple words of the same category are used, they must either play different grammatical roles (eg subj vs obj, or adj on a different item), be chosen from different dictionaries, or there needs to be an order-agnostic way to join them at the bit level. Preferably this should be avoided, just to prevent users forgetting the order.

2. As (1), but treat sentence generation as decoding a prefix code, and have a Huffman code for each word class. We suppose it’s okay if the generated sentence has a few more words than it might, as long as they’re common lean words. E.g., for adjectives, “good” might cost only six bits while “unfortunate” costs twelve.

Choice between different sentence syntaxes could be worked into the prefix code as well, and potentially done separately for each syntactic constituent.

Usage

To form mnemonic .onion URL, just join the words with dashes or underscores, stripping minimal words like 'a', 'the', 'and' etc., and append '.onion'. This can be readily distinguished from standard hash-style .onion URLs by form.

Translation should take place at the client — though hidden service servers should also be able to output the mnemonic form of hashes too, to assist website operators in publishing them (e.g. by posting an amusing drawing of the described situation on their website to reinforce the mnemonic).

After the translation stage of name resolution, everything proceeds as normal for an 80-bit hash onion URL.

The user should be notified of the mnemonic form of hash URL in some way, and have an easy way in the client UI to translate mnemonics to hashes and vice versa. For the purposes of browser URLs and the like though, the mnemonic should be treated on par with the hash; if the user enters a mnemonic URL they should not become redirected to the hash version. (If anything, the opposite may be true, so that users become used to seeing and verifying the mnemonic version of hash URLs, and gain the security benefits against partial-match fuzzing.)

Ideally, inputs that don't validly resolve should have a response page served by the proxy that uses a simple spell-check system to suggest alternate domain names that are valid hash encodings. This could hypothetically be done inline in URL input, but would require changes on the browser (normally domain names aren't subject so spellcheck), and this avoids that implementation problem.

International support

It is not possible for this scheme to support non-English languages without a) (usually) Unicode in domains (which is not yet well supported by browsers), and b) fully customized dictionaries and phrase patterns per language

The scheme must not be used in an attempted 'translation' by simply replacing English words with glosses in the target language. Several of the necessary features would be completely mangled by this (e.g. other languages have different synonym, homonym, etc groupings, not to mention completely different grammar).

It is unlikely a priori that URLs constructed using a non-English dictionary/pattern setup would in any sense 'translate' semantically to English; more likely is that each language would have completely unrelated encodings for a given hash.

We intend to only make an English version at first, to avoid these issues during testing.

________________ [1] https://trac.torproject.org/projects/tor/wiki/doc/HiddenServiceNames https://gitweb.torproject.org/torspec.git/blob/HEAD:/address-spec.txt [2] http://www.thc.org/papers/ffp.html [3] http://dot-bit.org/Namecoin [4] https://en.wikipedia.org/wiki/Zooko%27s_triangle [5] https://addons.mozilla.org/en-US/firefox/addon/petname-tool/ [6] However, service operators can generate a large number of hidden service descriptors and check whether their hashes result in a desirable phrasal encoding (much like certain hidden services currently use brute force generated hashes to ensure their name is the prefix of their raw hash). This won't get you whatever phrase you want, but will at least improve the likelihood that it's something amusing and acceptable. [7] "Meaningful" here inasmuch as e.g. "Barnaby thoughtfully mangles simplistic yellow camels" is an absurdist but meaningful sentence. Absurdness is a feature, not a bug; it decreases the probability of mistakes if the scenario described is not one that the user would try to fit into a template of things they have previously encountered IRL. See research into linguistic schema for further details. [8] https://gitweb.torproject.org/torspec.git/blob/HEAD:/proposals/ideas/xxx-oni... [9] http://blog.rabidgremlin.com/2010/11/28/4-little-words/ [10] http://fastword.me/ [11] https://tools.ietf.org/html/rfc1751 [12] http://tools.ietf.org/html/rfc2289 [13] https://github.com/singpolyma/mnemonicode http://mysteryrobot.com https://github.com/zacharyvoase/humanhash [14] http://www.mathcs.duq.edu/~juola/papers.d/icslp96.pdf [15] http://en.wikipedia.org/wiki/PGP_word_list [16] https://github.com/hellais/Onion-url https://github.com/hellais/Onion-url/blob/master/dev/mnemonic.py [17] http://www.reddit.com/r/technology/comments/ecllk [18] http://wordnet.princeton.edu/wordnet/man2.1/wnstats.7WN.html [19] https://plus.google.com/u/0/103112149634414554669/posts/DLfvB76Zhav

Sai

6:58 p.m.

Reformatted again for your committing pleasure:

Filename: xxx-mnemonic_urls.txt Title: Mnemonic .onion URLs Author: Sai, Alex Fink Created: 29-Feb-2012 Status: Open

1. Overview

This proposal could equally be used for IPv4, IPv6, etc, if normal DNS is for some reason untrusted.

This is inspired by Jonathan Ackerman's "Four Little Words" proposal[9] for doing the same thing with IPv4 addresses. We just need to handle 80+ bits, not just 32 bits.

2. Requirements

2.1. encodes at least 80 bits of random data (preferably more, eg for a checksum)

2.2. valid, visualizable English sentence — not just a series of words[17]

2.3. words are common enough that non-native speakers and bad spellers will have minimum difficulty remembering and producing (perhaps with some spellcheck help)

2.4. not syntactically confusable (e.g. order should not matter)

2.5. short enough to be easily memorized and fully recalled at will, not just recognized

2.6. no dependency on an external service

2.7. dictionary size small enough to be reasonable for end users to download as part of the onion package

2.8. consistent across users (so that websites can e.g. reinforce their random hash's phrase with a clever drawing)

2.9. not create offensive sentences that service providers will reject

2.10. resistant against semantic fuzzing (e.g. by having uniqueness against WordNet synsets[18])

3. Possible implementations

This section is intentionally left unfinished; full listing of template sentences and the details of their parser and generating implementation is co-dependent on the creation of word class dictionaries fulfilling the above criteria. Since that's fairly labor-intensive, we're pausing at this stage for input first, to avoid wasting work.

3.1. Have a fixed number of template sentences, such as:

1. Adj subj adv vtrans adj obj 2. Subj and subj vtrans adj obj 3. … etc

For a 6 word sentence, with 8 (3b) templates, we need ~12b (4k word) dictionaries for each word category.

3.2. As 3.1, but treat sentence generation as decoding a prefix code, and have a Huffman code for each word class.

We suppose it’s okay if the generated sentence has a few more words than it might, as long as they’re common lean words. E.g., for adjectives, "good" might cost only six bits while "unfortunate" costs twelve.

Choice between different sentence syntaxes could be worked into the prefix code as well, and potentially done separately for each syntactic constituent.

4. Usage

After the translation stage of name resolution, everything proceeds as normal for an 80-bit hash onion URL.

5. International support

We intend to only make an English version at first, to avoid these issues during testing.

________________

[1] https://trac.torproject.org/projects/tor/wiki/doc/HiddenServiceNames https://gitweb.torproject.org/torspec.git/blob/HEAD:/address-spec.txt [2] http://www.thc.org/papers/ffp.html [3] http://dot-bit.org/Namecoin [4] https://en.wikipedia.org/wiki/Zooko%27s_triangle [5] https://addons.mozilla.org/en-US/firefox/addon/petname-tool/ [6] However, service operators can generate a large number of hidden service descriptors and check whether their hashes result in a desirable phrasal encoding (much like certain hidden services currently use brute force generated hashes to ensure their name is the prefix of their raw hash). This won't get you whatever phrase you want, but will at least improve the likelihood that it's something amusing and acceptable. [7] "Meaningful" here inasmuch as e.g. "Barnaby thoughtfully mangles simplistic yellow camels" is an absurdist but meaningful sentence. Absurdness is a feature, not a bug; it decreases the probability of mistakes if the scenario described is not one that the user would try to fit into a template of things they have previously encountered IRL. See research into linguistic schema for further details. [8] https://gitweb.torproject.org/torspec.git/blob/HEAD:/proposals/ideas/xxx-oni on-nyms.txt [9] http://blog.rabidgremlin.com/2010/11/28/4-little-words/ [10] http://fastword.me/ [11] https://tools.ietf.org/html/rfc1751 [12] http://tools.ietf.org/html/rfc2289 [13] https://github.com/singpolyma/mnemonicode http://mysteryrobot.com https://github.com/zacharyvoase/humanhash [14] http://www.mathcs.duq.edu/~juola/papers.d/icslp96.pdf [15] http://en.wikipedia.org/wiki/PGP_word_list [16] https://github.com/hellais/Onion-url https://github.com/hellais/Onion-url/blob/master/dev/mnemonic.py [17] http://www.reddit.com/r/technology/comments/ecllk [18] http://wordnet.princeton.edu/wordnet/man2.1/wnstats.7WN.html

Jacob Appelbaum

7:11 p.m.

On 02/29/2012 02:58 PM, Sai wrote:

...

Reformatted again for your committing pleasure:

I've added this as proposal 194: https://gitweb.torproject.org/torspec.git/blob/HEAD:/proposals/194-mnemonic-...

Thanks!

All the best, Jake

Ken Takusagawa II

21 Mar 21 Mar

12:11 a.m.

On Feb 29, 2012 1:58 PM, "Sai" tor@saizai.com wrote:

...

For a 6 word sentence, with 8 (3b) templates, we need ~12b (4k word) dictionaries for each word category.

1. You need 2^8=256 templates, not just 8, to reach 6*12+8=80 bits.

2. Having toyed with this idea in the past, let me warn that forming a 4096 word dictionary of memorable, non-colliding words for each word category is going to be very difficult. Too many words are semantically similar, phonetically similar, or just unfamiliar. You might find Google Ngrams a good resource for common words; I provide a complete sorted list here:

http://kenta.blogspot.com/2012/02/lefoezyy-some-notes-on-google-books.html

Another way to go about it might be to first catalogue semantic categories (colors, animals, etc.) then list the most common (yet dissimilar) members of each category. An attempt at 64 words is here:

http://kenta.blogspot.com/2011/10/xpmqawkv-common-words.html

I'd propose that the "right" way to do this is not just sentences, but entire semantically consistent stories, written in rhyming verse, with entropy of perhaps only a few bits per sentence. (Prehistoric oral tradition does prove we can memorize such poems.) However, synthesizing these seem extremely difficult, an AI problem.

3. I presume people are familiar with Bubblebabble? It doesn't solve all the problems, but does make bit strings seem less "dense".

Ken

Sai

2:47 a.m.

On Tue, Mar 20, 2012 at 20:11, Ken Takusagawa II ken.takusagawa.2@gmail.com wrote:

...

You need 2^8=256 templates, not just 8, to reach 6*12+8=80 bits.

We won't know for sure how it hashes out until we make both the dictionaries and the syntax generator. The ambiguity was intentional.

But yes, it may well use a number of generated templates. We're thinking of making it symbolic expansion based, which is more efficient on bits but also more complicated to describe before it's fixed (and it'll require a parser library).

...

Having toyed with this idea in the past, let me warn that forming a 4096

word dictionary of memorable, non-colliding words for each word category is going to be very difficult. Too many words are semantically similar, phonetically similar, or just unfamiliar.

Our intention currently is to first take candidate dictionaries from WordNet, and use a combination of WordNet and Google 1-gram frequency data as part of the cutoff for whether words are adequately familiar. (N-grams with n >= 2 are rather irrelevant to our needs, AFAICT.)

...

http://kenta.blogspot.com/2012/02/lefoezyy-some-notes-on-google-books.html

Thanks; that could be useful.

...

Another way to go about it might be to first catalogue semantic categories (colors, animals, etc.) then list the most common (yet dissimilar) members of each category. An attempt at 64 words is here:

This is something that WordNet has already done.

...

http://kenta.blogspot.com/2011/10/xpmqawkv-common-words.html

I think you omit far more common words, which you shouldn't — eg air water coal man house etc.

But quibbling at this level is pointless; we'll need to be dealing with dictionaries mostly on the order of a few thousand words, sorted by *constituent types*, not be semantic categories. (E.g. one dictionary would be "nouns that can be the target of a transitive verb".)

...

I'd propose that the "right" way to do this is not just sentences, but entire semantically consistent stories, written in rhyming verse, with entropy of perhaps only a few bits per sentence. (Prehistoric oral tradition does prove we can memorize such poems.) However, synthesizing these seem extremely difficult, an AI problem.

I think it's currently impossible to do that, and furthermore, that it's *not* Right even if you could — because it would violate a key constraint: that it can be reasonably typed as a domain. It shouldn't take longer than a few seconds to remember and type. It won't be as fast as typing "google.com", and that's OK, but I think that level of redundant expansion is way too much.

Creating unambiguously parseable syntaxes and dictionaries that meet our stated constraints is already hard enough. ;-)

...

I presume people are familiar with Bubblebabble? It doesn't solve all

the problems, but does make bit strings seem less "dense".

BubbleBabble produces nonwords; as such it fails a basic requirement. Making something merely look phonotactically valid isn't enough; it has to be grammatically valid and composed entirely of known terms.

- Sai

Ken Takusagawa II

3:09 a.m.

One more note: the Soundex and Double Metaphone algorithms may be useful for determining if two words sound alike.

And yet one more attempt at something similar from years ago, doing only words, not grammatical sentences:

http://kenta.blogspot.com/2008/08/hash-of-words.html

Ken

Sai

3:43 a.m.

On Tue, Mar 20, 2012 at 23:09, Ken Takusagawa II ken.takusagawa.2@gmail.com wrote:

...

One more note: the Soundex and Double Metaphone algorithms may be useful for determining if two words sound alike.

True. But we don't care nearly as much about homophony as about synonymy.

Homophony is a heightened concern for things primarily intended to be conveyed over the phone (eg PGP fingerprints). Our primary concern is rather that the things can be memorized — and memory, at least for phrases rather than isolated words, is semantic.

E.g. we would not drop 'goat' just because we have 'coat', even though the two are very similar phonetically (just one voicing difference).

While we're at it, homography is also a lesser concern — it'll mainly come up when ensuring that parsing is unambiguous. (Granted, that's a significant caveat. :-P)

- Sai

4614

Age (days ago)

4636

Last active (days ago)

tor-dev@lists.torproject.org

13 comments

6 participants

tags (0)

participants (6)

Ahmed Hassan
Brandon Wiley
Jacob Appelbaum
Ken Takusagawa II
Robert Ransom
Sai