Proposal 280: Privacy-Preseving Statistics with Privcount in Tor - tor-dev

7 Aug 2017


      Hi!  Tim helped me write this this draft last week, and I shared it
with the PrivCount authors.  I've already gotten some good comments
from Aaron, which I'll repost in a followup message, with his
permission.
================================
Filename: 280-privcount-in-tor.txt
Title: Privacy-Preseving Statistics with Privcount in Tor
Author: Nick Mathewson, Tim Wilson-Brown
Created: 02-Aug-2017
Status: Draft
0. Acknowledgments
Tariq Elahi, George Danezis, and Ian Goldberg designed and implemented
  the PrivEx blinding scheme. Rob Jansen and Aaron Johnson extended
  PrivEx's differential privacy guarantees to multiple counters in
  PrivCount:
https://github.com/privcount/privcount/blob/master/README.markdown#research-...
Rob Jansen and Tim Wilson-Brown wrote the majority of the experimental
  PrivCount code, based on the PrivEx secret-sharing variant. This
  implementation includes contributions from the PrivEx authors, and
  others:
https://github.com/privcount/privcount/blob/master/CONTRIBUTORS.markdown
1. Introduction and scope
PrivCount is a privacy-preserving way to collect aggregate statistics
  about the Tor network without exposing the statistics from any single
  Tor relay.
This document describes the behavior of the in-Tor portion of the
  PrivCount system.  It DOES NOT describe the counter configurations,
  or any other parts of the system. (These will be covered in separate
  proposals.)
2. PrivCount overview
Here follows an oversimplified summary of PrivCount, with enough
  information to explain the Tor side of things.  The actual operation
  of the non-Tor components is trickier than described below.
All values in the scheme below are 64-bit unsigned integers; addition
  and subtraction are modulo 2^64.
In PrivCount, a Data Collector (in this case a Tor relay) shares
  numeric data with N different Tally Reporters. (A Tally Reporter
  performs the summing and unblinding roles of the Tally Server and Share
  Keeper from experimental PrivCount.)
All N Tally Reporters together can reconstruct the original data, but
  no (N-1)-sized subset of the Tally Reporters can learn anything about
  the data.
(In reality, the Tally Reporters don't reconstruct the original data
  at all! Instead, they will reconstruct a _sum_ of the original data
  across all participating relays.)
To share data, for each value X to be shared, the relay generates
  random values B_1 though B_n, and shares each B_i secretly with a
  single Tally Reporter.  The relay then publishes Y = X + SUM(B_i) + Z,
  where Z is a noise value taken at random from a gaussian distribution.
  The Tally Reporters can reconstruct X+Z by securely computing SUM(B_i)
  across all contributing Data Collectors. (Tally Reporters MUST NOT
  share individual B_i values: that would expose the underlying relay
  totals.)
In order to prevent bogus data from corrupting the tally, the Tor
  relays and the Tally Reporters perform multiple "instances" of this
  algorithm, randomly sampling in each relays. The relay sends multiple
  Y values for each measurement, built with different sets of B_i.
  These "instances" are numbered in order from 1 to R.
So that the system will still produce results in the event of a single
  Tally Reporter failure, these instances are distributed across multiple
  subsets of Tally Reporters.
Below we describe a data format for this.
3. The document format
This document format builds on the line-based directory format used
  for other tor documents, described in Tor's dir-spec.txt.
Using this format, we describe two kinds of documents here: a
  "counters" document that publishes all the Y values, and a "blinding"
  document that describes the B_i values.  But see "An optimized
  alternative" below.
The "counters" document has these elements:
"privctr-dump-format" SP VERSION SP SigningKey
[At start, exactly once]
Describes the version of the dump format, and provides an ed25519
       signing key to identify the relay.  The signing key is encoded in
       base64 with padding stripped. VERSION is "alpha" now, but should
       be "1" once this document is finalized.
[[[TODO: Do we need a counter version as well?
Noise is distributed across a particular set of counters,
          to provide differential privacy guarantees for those counters.
          Reducing noise requires a break in the collection.
          Adding counters is ok if the noise on each counter
          monotonically increases. (Removing counters always reduces
          noise.)
We also need to work out how to handle instances with mixed
          Tor versions, where some Data Collectors report different
          counters to other Data Collectors. (The blinding works if we
          substitute zeroes for missing counters on Tally Reporters.
          But we also need to add noise in this case.)
-teor
        ]]]
"starting-at" SP IsoTime
[Exactly once]
The start of the time period when the statistics here were
       collected.
"ending-at" SP IsoTime
[Exactly once]
The end of the time period when the statistics here were
       collected.
"num-instances" SP Number
[Exactly once]
The number of "instances" that the relay used (see above.)
"tally-reporter" SP Identifier SP Key SP InstanceNumbers
[At least twice]
The curve25519 public key of each Tally Reporter that the relay
       believes in.  (If the list does not match the list of
       participating tally reporters, they won't be able to find the
       relay's values correctly.)  The identifiers are non-space,
       non-nul character sequences.  The Key values are encoded in
       base64 with padding stripped; they must be unique within each
       counters document.  The InstanceNumbers are comma-separated lists
       of decimal integers from 0 to (num-instances - 1), in ascending
       order.
Keyword ":" SP Int SP Int SP Int ...
[Any number of times]
The Y values for a single measurement.  There are num-instances
       such Y values for each measurement.  They are 64-bit unsigned
       integers, expressed in decimal.
The "Keyword" denotes which measurement is being shared. Keyword
       MAY be any sequence of characters other than colon, nul, space,
       and newline, though implementators SHOULD avoid getting too
       creative here.  Keywords MUST be unique within a single document.
       Tally Reporters MUST handle unrecognized keywords.  Keywords MAY
       appear in any order.
It is safe to send the blinded totals for each instance to every
       Tally Reporter. To unblind the totals, a Tally Reporter needs:
         * a blinding document from each relay in the instance, and
         * the per-counter blinding sums from the other Tally Reporters
           in their instance.
[[[TODO: But is it safer to create a per-instance counters
          document? -- teor]]]
The semantics of individual measurements are not specified here.
"signature" SP Signature
[At end, exactly once]
The Ed25519 signature of all the fields in the document, from the
       first byte, up to but not including the "signature" keyword here.
       The signature is encoded in base64 with padding stripped.
The "blinding" document has these elements:
"privctr-secret-offsets" SP VERSION SP SigningKey
[At start, exactly once.]
The VERSION and SigningKey parameters are the same as for
       "privctr-dump-format".
"instances" SP Numbers
[Exactly once]
The instances that this Tally Reporter handles.
       They are given as comma-separated decimal integers, as in the
       "tally-reporter" entry in the counters document.  They MUST
       match the instances listed in the counters document.
[[[TODO: this is redundant. Specify the constraint instead? --teor]]]
"num-counters" SP Number
[Exactly once]
The number of counters that the relay used in its counters
       document. This MUST be equal to the number of keywords in the
       counters document.
[[[TODO: this is redundant. Specify the constraint instead? --teor]]]
"tally-reporter-pubkey" SP Key
[Exactly once]
The curve25519 public key of the tally reporter who is intended
       to receive an decrypt this document.  The key is base64-encoded
       with padding stripped.
"count-document-digest" SP "sha3" Digest NL
    "-----BEGIN ENCRYPTED DATA-----" NL
    Data
    "-----END ENCRYPTED DATA-----" NL
[Exactly once]
The SHA3-256 digest of the count document corresponding to this
       blinding document.  The digest is base64-encoded with padding
       stripped.  The data encodes the blinding values (See "The
       Blinding Values") below, and is encrypted to the tally reporter's
       public key using the hybrid encryption algorithm described below.
"signature" SP Signature
[At end, exactly once]
The Ed25519 signature of all the fields in the document, from the
       first byte, up to but not including the "signature" keyword here.
       The signature is encoded in base64 with padding stripped.
4. The Blinding Values
The "Data" field of the blinding documents above, when decrypted,
  yields a sequence of 64-bit binary values, encoded in network
  (big-endian) order.  There are C * R such values, where C is the number
  of keywords in the count document, and R is the number of instances
  that the Tally Reporter participates in. The client generates all of
  these values uniformly at random.
For each keyword in the count document, in the order specified by the
  count document, the decrypted data holds R*8 bytes for the specified
  instance of that keyword's blinded counter.
For example: if the count document lists the keywords "b", "x", "g",
  and "a" (in that order), and lists instances "0", and "2", then the
  decrypted data will hold the blinding values in this order:
      b, instance 0
      b, instance 2
      x, instance 0
      x, instance 2
      g, instance 0
      g, instance 2
      a, instance 0
      a, instance 2
4. Implementation Notes
A relay should, when starting a new round, generate all the blinding
  values and noise values in advance.  The relay should then use these
  values to compute Y_0 = SUM(B_i) + Z for each instance of each
  counter.  Having done this, the relay MUST encrypt the blinding values
  to the public key of each tally reporter, and wipe them from memory.
5. The hybrid encryption algorithm
We use a hybrid encryption scheme above, where items can be encrypted
  to a public key.  We instantiate it as follows, using curve25519
  public keys.
To encrypt a plaintext M to a public key PK1
     1. the sender generates a new ephemeral keypair sk2, PK2.
     2. The sender computes the shared diffie hellman secret
        SEED = (sk2 * PK1).
3. The sender derives 64 bytes of key material as
          SHAKE256(TEXT | SEED)[...64]
        where "TEXT" is "Expand curve25519 for privcount encryption".
The first 32 bytes of this is an aes key K1;
        the second 32 bytes are a mac key K2.
4. The sender computes a ciphertext C as AES256_CTR(K1, M)
5. The sender computes a MAC as
          SHA3_256([00 00 00 00  00 00 00 20] | K2 | C)
6. The hybrid-encrypted text is PK2 | MAC | C.
6. An optimized alternative
As an alternative, the sequences of blinding values is NOT transmitted
   to the tally reporters.  Instead the client generates a single
   ephemeral keypair sk_c, PK_c, and places the public key in its counts
   document.  It does this each time a new round begins.
For each tally reporter with public key PK_i, the client then does
   the handshake sk_c * PK_i to compute SEED_i.
The client then generates the blinding values for that tally reporter
   as SHAKE256(SEED_i)[...R*C*8].
After initializing the counters to Y_0, the client can discard the
   blinding values and sk_c.
Later, the tally reporters can reconstruct the blinding values as
   SHAKE256(sk_i * PK_c)[...]
This alternative allows the client to transmit only a single public
   key, when previously it would need to transmit a complete set of
   blinding factors for each tally reporter. Further, the alternative
   does away with the need for blinding documents altogether.  It is,
   however, more sensitive to any defects in SHAKE256 than the design
   above.  Like the rest of this design, it would need rethinking if we
   want to expand this scheme to work with anonymous data collectors,
   such as Tor clients.