Tools to generate extra metadata

bug1 · April 17, 2015, 3:15am

Mandatory metadata retention forces corporations to collect data that is of little use on its own, the goal is that by combining all this useless information the government can make something useful out of it.

Its like a needle in a haystack, if they keep shovelling more hay on the haystack, maybe they will accidentally add another needle to the haystack, which they can then search for. They cant look for the needles before they put it on the haystack, because its someone else’s job to define what a needle is, and they might change their definition at a later date.

The problem for them is the sorters are expensive, and every time they change their definition of what a needle is, they have to start again. An extreme example of this is utah data center which sorts for needles as defined by the US government. It uses 65MW of electricity per year and millions of gallons of water per month. They sort a haystack of between 3 and 12 exabytes, which is between 416MB and 1.25GB for every person on the planet (before compression).

Government judges everything based on cost and this new surveillance system will be expensive, even if they push the collection and storage costs onto “service providers” they still need to do the sorting themselves (and thats more expensive than storage).

The costs of sorting can be influenced by us, the users, the more metadata we generate, the more infeasible the whole operation becomes.

Maybe we should be looking to start/promote projects that generate lots of metadata per user ?

One idea that comes to mind is a distributed search engine. So end users slowly crawling through the web in a coordinated way and pool the results, the person doing the crawler gains increased privacy by adding noise to the system, and there is a useful result in the end. An independent search engine also gives control back to users as big brother can force corporations to remove items from their index, but maybe not from a p2p system. (a p2p system could choose to remove items, but it would be more transparent)

EDIT: Trokam looks interesting

Maybe there are more practical ways to generate metadata, but just throwing that out there.

Azza · April 17, 2015, 5:02am

I think that is a good idea, is the intention behind this to share openly? Am wondering if there a page somewhere that lists publicly consolidated tips/ideas/best practices not for circumventing the law but for assisting the average internet user with privacy etc? Just another way to get more traffic to the site, e.g. you could just have 1 huge page or pages with links to other sites that are pirate party approved.

I think that the government will proceed with the metadata generation regardless but I think that the Australian government if they are smart should just get the USA to do the dirty work because they already have all the systems in place, whatever additional data required could be sent to them overseas tho not without its own risks.

Am interested in the amount of additional metadata one can generate with the search engines though how much extra data are we talking?

Dada · April 17, 2015, 6:16am

Surveillance self defence instructions from Electronic Frontiers

Surveillance self defence for those living under repressive regimes

bug1 · April 17, 2015, 7:17am

My rough calculation below says they might need to store 60 bytes per connection. An IPv4 packet is 20 bytes, an ICMP packet adds an extra 8 bytes (depending on type). So its possible there might be more metadata stored than what is sent for specifically crafted messages. (which is significant)

How much metadata a user will generated will be limited by a users upload capacity, and the ratio of metadata recorded to messages sent. It could become extremely burdensome to service provides, and the system unusable by the spies.

It would be better if our resources could be put to a more constructive purpose than just throwing wrench in the gears. Which is why a distributed web crawler would be good, as it adds a (more) legitimate reason to take part.

I am not a lawyer, but its probably important that sites that are connected to are not chosen by the user, so there is no intent to go to unacceptable sites, also indexing a site doesnt necessarily need to store any content like images, just needs to calculate the hash of stream(s) of data on the fly.

Required data to be stored

Account details: (eg account number) 2 bytes
Source of communications: IPv4 Address, 4 bytes
Destination: eg IPv4 Address, 4 bytes
4a. Start of Communications: 9 Bytes (assuming there is communications and it isnt a “hangup”)
- Date: 4 Bytes
- Time (with sufficient accuracy to identify the communication): 4 bytes ?
- Timezone: 1 byte
  4b. End of Communications: 9 Bytes (If there is a start of communications)
  4c. Start of Connection: 9 Bytes
  4d. End of connection: 9 Bytes (If it is connection oriented (TCP, rather than connectionless UDP)
  5a. Type of communication (Protocol): 2 Bytes?
  5b. Type of Service (eg adsl): probably stored per account, so no extra for each message
  5c. Features of service (eg data volume): 4 Bytes (depends on implementation)
  6a. Location of equipment at start of communication: not sure its relevant for internet, 4 bytes for GPS
  6b. Location of equipment at end of communication: not sure its relevant for internet, 4 bytes for GPS