Juha Nurmi juha.nurmi@ahmia.fi writes:
And what would you like to do over the summer so that: a) Something useful and concrete comes out of only 3 months of work. b) Your work will also be useful after the summer ends.
I would be interested to see some areas that you would like to work on over the summer, and how that would change the ahmia.fi user experience.
I have drafted a timetable for the possible new features to ahmia.fi:
https://docs.google.com/document/d/1XB42HM4uESYBAnoHHRuaqKMP64VFDI91Qa-CtIuy...
Hello Juha,
here are some comments on your proposal:
Search development
Full text search development Popularity tracking (catch users clicks and tell YaCy the popular
pages): development of a popularity tracking feature for ahmia.fi and Integration of that feature with YaCy API (providing stats for popular pages and suggestions for relevant results)
1-3 workdays
Yes, this is definitely useful.
I would also like you to check out how backlinks work, and whether your crawler can start counting HS backlinks too. Mainly because popularity tracking is easily gameable, whereas backlinks might be harder to game (still definitely gameable though; SEO is crazy).
To make sure that this section is done properly, I would suggest to compile a list of well-known HSes and verify that they all appear on/near the top of the ahmia search results by the end of development of these features.
I would suggest using more than 1-3 workdays for this.
Use an another crawler to search .onion pages from the public Internet Search new .onion domains from different online sources This is an excellent case to test open source crawlers like Heritrix and
Apache Nutch
1 workweeks
Yes, this is very useful.
Public open YaCy back-end for everyone let’s make our YaCy network open so anyone can join to it with their
YaCy nodes
This way we could get real P2P decentralization Share installation configuration package that joins a YaCy node to
ahmia.fi’s nodes
1 workweek
I guess this is also useful.
Better edited HS descriptions Design and development of a more useful and complete UI including more
comple and exaustive descriptions and details (e.g., show the whole history of descriptions and let the users edit it better)
1 workweek
Yes this seems like a good idea. Improving the UX is very important.
Because of the security nature of ahmia, the UX should be security conscious too. For example, you shouldn't give your users too much confidence on the ordering of the search results since a motivated adversary can probably influence it.
Maybe you could also expose some of your popularity/backlinks information to users, in case that lets them pick results more safely.
Comment and vote about the content (safe/unsafe) Ahmia.fi needs a commenting and rating systems for hidden services It is useful to gather a user's knowledge about the sites 1 workweek
I think that this needs more thinking.
The rating idea is trivially gameable. Do we assume that all users are good citizens?
Given that there are shitloads of phising websites registered to ahmia, we take it that there are bad people out there who know of ahmia. How will the rating system interact with bad people? What about the comming system? Is this also an argument against popularity tracking? How do we use this technologies usefully in the face of bad people?
Tor browser friendly version of the ahmia.fi Development of a JavaScript free version of ahmia.fi 1 workweek
TBB has javascript enabled these days. I would probably spend this one week on other stuff.
Search API 1 workweek
What do you mean by this? Do other search engines provide this sort of API?
This would need more than one week to design and deploy properly, no?
Automated statistics and visualizations about hidden services and their
content
Development of an Analytics feature As the result of the indexing Tor network's content ahmia.fi can produce
an authoritative and exact quantitative research data about what is published through the Tor network.
2 workweeks
Automated visualizations It is very practical to visualize the data 2 workweeks
Both of the above items are statistics and they seem to require 1 month of development. Are there really that many stats that we can/should produce?
What kind of stats are you thinking of, other than the "number of HSes added per month", "number of ahmia visitors", etc.? BTW, we should be very careful that stats are privacy preserving.
Show cached text versions of the pages 1 workweek
Useful. I thought you had this feature in the past though; no?
API development
In addition, ahmia.fi provides RESTful API to integrate other services
to use hidden service description information (see https://ahmia.fi/documentation/). Hidden services can integrate their descriptions directly to the hidden service list (see https://ahmia.fi/documentation/descriptionProposal/). Ahmia.fi knows which hidden services are online and you can use the API to check hidden service's online status. This API should be maintained general and simple.
Integration with softwares that are using hidden services Integration with Tor2web Thanks to our suggestion recently, Tor2web has implemented a feature
that provides secure and anonymous statistics within a day. I want to implement to implement an automatic fetch and handling of this data.
Ahmia.fi should fetch these and add each new .onion page Child pornographic is a plague for the Tor network and a well designed
and authoritative entity may be useful for provide some filtering lists. To this aim we are currently handling manually a filter list already integrated with Tor2web and in use on quite all the nodes of the Tor2web network (https://ahmia.fi/policy/, https://github.com/globaleaks/Tor2web-3.0/issues/25). In collaboration with Tor2web i want to develop an efficient and automated system to handle and share a filtering information in a secure manner.
1 workweek
Hm, this is interesting but potentially controversial. Where is this data?
Development of a Content Abuse Signaling feature in order to allow
fast handling of abuse comments; i want to implement a Callback API in order to publish this data to Tor2web nodes in real-time.
1-3 workdays
Ehm, so you are going to expose all the banned pages to Tor2Web? Is this API going to be public? Will anyone be able to see the banned pages?
If it's not public, how are you going to protect it? Is this doable in 1-3 workdays? Is this worth doing?
Globaleaks integration Currently, GlobaLeaks informs ahmia.fi to index new hidden services Ahmia.fi could extend the visibility of Globaleaks on the search results Together with GlobaLeaks: RESTful API according to Globaleaks’ needs 1 workweek
So you will make an API that allows people to submit HSes to ahmia? Will this be useable by anyone; can it be exploited? If not, how will you protect it? Is this really worth doing?
Estimated amount of work is 13 weeks.
All in all, the timetable looks good.
I'm quite excited about the changes to your crawler (that will give us a bigger list of HSes), and the changes to your indexing (popularity tracking/backlinks etc.). I think you should devote more time to these so that they are done properly. You currently estimated 1.5 weeks to those tasks, but maybe you could bump it to 3 or 4 weeks. OTOH, I don't know much about search engines so it might be easier than I think.
I'm also excited about the UX changes and statistics, but I'm not sure if I would devote one month just for statistics. Maybe steal some time from statistics and give it to the crawler/indexing and UX? Maybe not?
The API stuff and the "Integration with *" projects are probably harder/riskier to do than they seem. Are we sure we want to do them? Better to do fewer things properly, than many things sloppily. Or not?
I would also like to see the code base cleaned up a bit. For example a README file, some basic description of what each file is doing. Probably also include the YaCy/crawler configs?
I would also like Ahmia to have some docs on the website. I would like to see a doc on how ahmia works, including how its components interact with each other. And I would also like to see a doc that explains to users the threat model of Ahmia; that is, what technologies ahmia has in place to defend against phishing, how likely they are to succeed, and how cautious users should be.
2014-03-16 20:13 GMT+01:00 George Kadianakis <desnacked@riseup.net mailto:desnacked@riseup.net>:
> API development > > In addition, ahmia.fi http://ahmia.fi/ provides RESTful API to integrate other services to use hidden service description information (see https://ahmia.fi/documentation/ https://ahmia.fi/documentation/). Hidden services can integrate their descriptions directly to the hidden service list (see https://ahmia.fi/documentation/descriptionProposal/ https://ahmia.fi/documentation/descriptionProposal/).
I just added manually all the known globaleaks sites (http://en.wikipedia.org/wiki/GlobaLeaks#GlobaLeaks_uses) to the Ahmia Directory, so we can test the experimental feature of GlobaLeaks that generate a Description.json documented at https://ahmia.fi/documentation/descriptionProposal/
> Integration with Tor2web > Thanks to our suggestion recently, Tor2web has implemented a feature that provides secure and anonymous statistics within a day. I want to implement to implement an automatic fetch and handling of this data. > Ahmia.fi should fetch these and add each new .onion page
The experimental statistics documented here: https://github.com/globaleaks/Tor2web-3.0/wiki/OpenData
Fabio
Hi George,
i answer for the two tor2web related questions: 1) where are the stats of Tor2web? in collaboration with Ahmia i've decided to implement some stats, but due to the fact that we don't to want to provide I) realtime data useful for tracking II) so much information we decided to implement the following:
tor2web logs, only in RAM and using a two days window, a cumulative counter for hidden services and make it available for access only to the yesterday stats in order to not provide this data in realtime so that the data provided is simply: hidden service 1: #22 hidden service 2: #333 ...
here is the example: https://antani.tor2web.org/antanistaticmap/stats/yesterday
2) what about list of blocked hs? tor2web also provides the configuration for blocked hs by simply using MD5 of resources so that the url of the blocked page is not available since the configuration time. due to this simple implementation we make always a check for the MD5 of the hidden service and an MD5 of the full resource, if that matchs it means that the page must be blocked.
tor2web also does currently fetch ahmia list in order to get automatic updates of this hashed blocklist.
this is the list provided by ahmia: https://ahmia.fi/bannedMD5.txt (numes it currently list only an entry why?)
George if you or anybody else have good suggestion for better improval, you are welcome :)
Giovanni
The rating idea is trivially gameable. Do we assume that all users are good citizens?
Given experience with onionland, unless you are building your own review team, I too would be careful with allowing random user input or believing it to have any given percentage of good. There are already large networks dedicated to self reinforcing their own bad intentions out there.
I'm quite excited about the changes to your crawler (that will give us a bigger list of HSes), Probably also include the YaCy/crawler configs?
I'd like to compare them to what I might deploy on my YaCy, And more importantly, to test YaCy's capability to reach the top live onion counts I'm finding with custom crawling. I like to chat with ahmia and more crawlers in future as my backend comes together better.
From seeing a prototype already, I'd second looking at
nutch and/or some nosql (now maybe nutch v2) as well long term.
I also suggest tor2web et al review the wisdom in building more services on top of the basic gatewaying that exists. At some point you need to be moving people and yourself to simply run the Tor client, not to build comfy all in one clearnet home so they just be lazy and suck your gateway nipple forever.
Publishing these types of lists works the good task of helping seed crawlers with new onions as well. Note that you should include the '.onion' suffixes so that all crawler parsers can recognize and extract them without having to parse you specifically.
if somebody wants to see the stats example please use this link: https://antani.onion.to/antanistaticmap/stats/yesterday
in fact not all the nodes in the tor2web.org round robin implement it.
Giovanni
Hi,
Thank you George, Fabio and Giovanni! :)
I gathered these comments to the Google Docs: https://docs.google.com/document/d/1XB42HM4uESYBAnoHHRuaqKMP64VFDI91Qa-CtIuy...
I have wrote a comment to the each comment.
Furthermore, I modified the application: - explained and redefined some tasks - removed tasks "JavaScript free version of ahmia.fi" and "Search API" - rescheduled more time to the search development (+2 weeks) and took the time from the visual and statistical data publishing part (-2 weeks) - total estimated amount of work is 13 weeks
I am writing a scientific paper about ahmia.fi right now and planning to get it ready during this month. This will be an important documentation for us.
What do you think George?
Greetings Juha
On Mon, Mar 17, 2014 at 10:18 AM, Giovanni `evilaliv3` Pellerano < giovanni.pellerano@evilaliv3.org> wrote:
if somebody wants to see the stats example please use this link: https://antani.onion.to/antanistaticmap/stats/yesterday
in fact not all the nodes in the tor2web.org round robin implement it.
Giovanni _______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
"Nurmi, Juha" juha.nurmi@ahmia.fi writes:
Hi,
Thank you George, Fabio and Giovanni! :)
I gathered these comments to the Google Docs: https://docs.google.com/document/d/1XB42HM4uESYBAnoHHRuaqKMP64VFDI91Qa-CtIuy...
I have wrote a comment to the each comment.
Furthermore, I modified the application:
- explained and redefined some tasks
- removed tasks "JavaScript free version of ahmia.fi" and "Search API"
- rescheduled more time to the search development (+2 weeks) and took the
time from the visual and statistical data publishing part (-2 weeks)
- total estimated amount of work is 13 weeks
I am writing a scientific paper about ahmia.fi right now and planning to get it ready during this month. This will be an important documentation for us.
What do you think George?
Looks good, I think.
But now that you don't have a "Search API" project, what are you going to do during the Globaleaks integration?
Also, are you sure that 1-3 workdays are sufficient to design & implement a banned domain synchronizer between tor2web and ahmia?
BTW, you are supposed to do your application in Google Melange, not in this mailing list (although I'm happy you posted your app here so that more people can comment on it!). The website is: https://www.google-melange.com/gsoc/homepage/google/gsoc2014 The deadline is in 4 days or so, I think.
PS: Nitpick but I should exercise my allergy to broken crypto and suggest to switch to a better hash algorithm (SHA256 or so) instead of MD5 for passing banned domain names around.
On 17.03.2014 15:17, George Kadianakis wrote:
But now that you don't have a "Search API" project, what are you going to do during the Globaleaks integration?
The search API was supposed to be a query API to the ahmia's database. However, this is not a relevant feature at the moment.
Also, are you sure that 1-3 workdays are sufficient to design & implement a banned domain synchronizer between tor2web and ahmia?
Well, I cannot know that. Let's put one workweek for that. I am hoping to spend a workday or two with Tor2web and we get it done.
BTW, you are supposed to do your application in Google Melange, not in this mailing list (although I'm happy you posted your app here so that more people can comment on it!). The website is: https://www.google-melange.com/gsoc/homepage/google/gsoc2014 The deadline is in 4 days or so, I think.
Sure!
PS: Nitpick but I should exercise my allergy to broken crypto and suggest to switch to a better hash algorithm (SHA256 or so) instead of MD5 for passing banned domain names around.
The reason we are publishing MD5sums of the banned domains is that in some countries it is illegal to own or host a list of CP URLs. Anyway, if someone is looking for CP .onions he will find them... However, I do not see any reason why we couldn't together with Tor2web change to a better hash algorithm.
-Juha
Juha Nurmi juha.nurmi@ahmia.fi writes:
On 17.03.2014 15:17, George Kadianakis wrote:
But now that you don't have a "Search API" project, what are you going to do during the Globaleaks integration?
The search API was supposed to be a query API to the ahmia's database. However, this is not a relevant feature at the moment.
Also, are you sure that 1-3 workdays are sufficient to design & implement a banned domain synchronizer between tor2web and ahmia?
Well, I cannot know that. Let's put one workweek for that. I am hoping to spend a workday or two with Tor2web and we get it done.
How is ahmia going to communicate with tor2web? Will the connection be authenticated? How will you block bad people from adding their own stuff to your blacklist?
Also, are you sure that 1-3 workdays are sufficient to design & implement a banned domain synchronizer between tor2web and ahmia?
Well, I cannot know that. Let's put one workweek for that. I am hoping to spend a workday or two with Tor2web and we get it done.
How is ahmia going to communicate with tor2web? Will the connection be authenticated? How will you block bad people from adding their own stuff to your blacklist?
One way to solve this is to download a list of working Tor2web nodes from the github. These nodes are added manually to the github. After that I can download the information from the nodes everyday. On the other hand, Tor2web software can download the list of the banned domains from ahmia.fi. This is one easy way to handle the information exchange.
Juha Nurmi juha.nurmi@ahmia.fi writes:
Also, are you sure that 1-3 workdays are sufficient to design & implement a banned domain synchronizer between tor2web and ahmia?
Well, I cannot know that. Let's put one workweek for that. I am hoping to spend a workday or two with Tor2web and we get it done.
How is ahmia going to communicate with tor2web? Will the connection be authenticated? How will you block bad people from adding their own stuff to your blacklist?
One way to solve this is to download a list of working Tor2web nodes from the github. These nodes are added manually to the github. After that I can download the information from the nodes everyday. On the other hand, Tor2web software can download the list of the banned domains from ahmia.fi. This is one easy way to handle the information exchange. _______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
It seems to me that with this architecture the list of banned hosts is exposed to anyone on the Internet.
I'm not sure if this is a good thing or a bad thing and it's not up to me to decide, but just wanted to point it out.
I do not see any paricular risk in exposing the hashed list. the reason behind the hashed list is exactly that we want to allow publishing without any risk to publish direct link to child porn contents or other shit. anyhow i'm really interested in others opinions.
i agree on the SHA-256; the reason why we have not changed it since time is that we don't want to loose the current banned list. in fact no one is keeping the cleartext version of the banned links.
Giovanni
"Giovanni `evilaliv3` Pellerano" giovanni.pellerano@evilaliv3.org writes:
I do not see any paricular risk in exposing the hashed list. the reason behind the hashed list is exactly that we want to allow publishing without any risk to publish direct link to child porn contents or other shit. anyhow i'm really interested in others opinions.
Ah, I just got you are publishing the hashed list.
Sorry, I was confused.
"Nurmi, Juha" juha.nurmi@ahmia.fi writes:
Hi,
Thank you George, Fabio and Giovanni! :)
I gathered these comments to the Google Docs: https://docs.google.com/document/d/1XB42HM4uESYBAnoHHRuaqKMP64VFDI91Qa-CtIuy...
I have wrote a comment to the each comment.
Furthermore, I modified the application:
- explained and redefined some tasks
- removed tasks "JavaScript free version of ahmia.fi" and "Search API"
- rescheduled more time to the search development (+2 weeks) and took the
time from the visual and statistical data publishing part (-2 weeks)
- total estimated amount of work is 13 weeks
I am writing a scientific paper about ahmia.fi right now and planning to get it ready during this month. This will be an important documentation for us.
What do you think George?
Another thing that ahmia could develop during GSoC is improve its detection of whether an HS is down. It seems that many of the results in ahmia are actually dead HSes, and there are also many results tagged as 'DEAD'.
Replying to some new additions in the proposal:
Thanks asn! "Ask help from organizations that are crawling" Today I emailed to duckduckgo and asked is there an easy way to search new .onions using their search engine. "Checking out the backlinks from public WWW" With known onion address it is possible to find the popularity of an address checking the number of search results: https://duckduckgo.com/?q=%22http%3A%2F%2Fjlve2y45zacpbz6s.onion%22 and https://www.google.com/#q=%22http:%2F%2Fjlve2y45zacpbz6s.onion%22 and https://www.google.com/#q=link:http:%2F%2Fjlve2y45zacpbz6s.onion This way I will get a list that tells the popularity according to links from the public WWW: onion address & number of WWW sites that are linking to it xyz.onion 123 abc.onion 90 uio.onion 24 mre.onion 17 Today I asked from the YaCy's developer how could I use this information. "Commenting features" I agree that commenting might be a mouth of madness because people might write just some random crap there. Technically this would be developed to the Django framework. Note that the priority of this task is low (10). We could decide to leave this commenting feature to the very last task or skip it.
ACK wrt commenting.
As far as backlinks are concerned, while I appreciate how rapid and easy your solution is, you might want to make it a bit more robust.
The way you did it, you treat the 123 references to 'xyz.onion', as strictly better than the 90 references to 'abc.onion'. This is not the case in the real web, since the 123 references to 'xyz.onion' might be SEO and they might be coming from xyz.onion itself or related websites.
Proper search engines assign weights to each backlink, according to how legit the search engine believes the linker to be. This has to do with how many backlinks the linker had, and how legit the HTML content of the linker looks like, etc. You can find more heuristics that search engines use by skimming an SEO book or an SEO forum.
It's up to you how deep you want to go into backlinking during GSoC, but IMO backlinking is a more reliable heuristic than popularity tracking. Up to you anyway!
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 24.03.2014 13:57, George Kadianakis wrote:
Replying to some new additions in the proposal:
Thanks asn! "Ask help from organizations that are crawling" Today I emailed to duckduckgo and asked is there an easy way to search new .onions using their search engine. "Checking out the backlinks from public WWW" With known onion address it is possible to find the popularity of an address checking the number of search results: https://duckduckgo.com/?q=%22http%3A%2F%2Fjlve2y45zacpbz6s.onion%22
and https://www.google.com/#q=%22http:%2F%2Fjlve2y45zacpbz6s.onion%22
and https://www.google.com/#q=link:http:%2F%2Fjlve2y45zacpbz6s.onion This way I will get a list that tells the popularity according to links from the public WWW: onion address & number of WWW sites that are linking to it xyz.onion 123 abc.onion 90 uio.onion 24 mre.onion 17 Today I asked from the YaCy's developer how could I use this information. "Commenting features" I agree that commenting might be a mouth of madness because people might write just some random crap there. Technically this would be developed to the Django framework. Note that the priority of this task is low (10). We could decide to leave this commenting feature to the very last task or skip it.
ACK wrt commenting.
As far as backlinks are concerned, while I appreciate how rapid and easy your solution is, you might want to make it a bit more robust.
The way you did it, you treat the 123 references to 'xyz.onion', as strictly better than the 90 references to 'abc.onion'. This is not the case in the real web, since the 123 references to 'xyz.onion' might be SEO and they might be coming from xyz.onion itself or related websites.
Proper search engines assign weights to each backlink, according to how legit the search engine believes the linker to be. This has to do with how many backlinks the linker had, and how legit the HTML content of the linker looks like, etc. You can find more heuristics that search engines use by skimming an SEO book or an SEO forum.
It's up to you how deep you want to go into backlinking during GSoC, but IMO backlinking is a more reliable heuristic than popularity tracking. Up to you anyway!
_______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
We could test the reliability of the linkers too. As you said, there are multiple methods to do this. Because the number of .onions and the linkers is relatively small we can analyze the linking sites too. Usually there are <10 sites linking to an .onion site.
- -Juha