On Tue, Mar 20, 2012 at 20:11, Ken Takusagawa II ken.takusagawa.2@gmail.com wrote:
- You need 2^8=256 templates, not just 8, to reach 6*12+8=80 bits.
We won't know for sure how it hashes out until we make both the dictionaries and the syntax generator. The ambiguity was intentional.
But yes, it may well use a number of generated templates. We're thinking of making it symbolic expansion based, which is more efficient on bits but also more complicated to describe before it's fixed (and it'll require a parser library).
- Having toyed with this idea in the past, let me warn that forming a 4096
word dictionary of memorable, non-colliding words for each word category is going to be very difficult. Too many words are semantically similar, phonetically similar, or just unfamiliar.
Our intention currently is to first take candidate dictionaries from WordNet, and use a combination of WordNet and Google 1-gram frequency data as part of the cutoff for whether words are adequately familiar. (N-grams with n >= 2 are rather irrelevant to our needs, AFAICT.)
http://kenta.blogspot.com/2012/02/lefoezyy-some-notes-on-google-books.html
Thanks; that could be useful.
Another way to go about it might be to first catalogue semantic categories (colors, animals, etc.) then list the most common (yet dissimilar) members of each category. An attempt at 64 words is here:
This is something that WordNet has already done.
http://kenta.blogspot.com/2011/10/xpmqawkv-common-words.html
I think you omit far more common words, which you shouldn't — eg air water coal man house etc.
But quibbling at this level is pointless; we'll need to be dealing with dictionaries mostly on the order of a few thousand words, sorted by *constituent types*, not be semantic categories. (E.g. one dictionary would be "nouns that can be the target of a transitive verb".)
I'd propose that the "right" way to do this is not just sentences, but entire semantically consistent stories, written in rhyming verse, with entropy of perhaps only a few bits per sentence. (Prehistoric oral tradition does prove we can memorize such poems.) However, synthesizing these seem extremely difficult, an AI problem.
I think it's currently impossible to do that, and furthermore, that it's *not* Right even if you could — because it would violate a key constraint: that it can be reasonably typed as a domain. It shouldn't take longer than a few seconds to remember and type. It won't be as fast as typing "google.com", and that's OK, but I think that level of redundant expansion is way too much.
Creating unambiguously parseable syntaxes and dictionaries that meet our stated constraints is already hard enough. ;-)
- I presume people are familiar with Bubblebabble? It doesn't solve all
the problems, but does make bit strings seem less "dense".
BubbleBabble produces nonwords; as such it fails a basic requirement. Making something merely look phonotactically valid isn't enough; it has to be grammatically valid and composed entirely of known terms.
- Sai