As part of an independent study course at the University of Washington, I spent time trying to improve the usability of Tor. Tor is a network that forwards internet traffic between several computers to who the user that is accessing the site is and what sites a specific user is accessing. My ideas on how to improve the network did not pan out, but my testing raises some interesting questions and it is best shared in an online format.
In 2009, search Google for “tiananmen square protest” in China would yield very different results than the same search in any western nation:
This is due to China’s heavy handed censorship. In 2009, when the above comparison was generated, Google cooperated with the Chinese authorities in censoring it’s search results, as is expected of any technology company in China. Much like parental software attempts to block pornography from children, China also blocks web-pages based on keywords and website blacklists. Major swaths of the internet are either directly blocked (Google search, Flickr, Facebook, Twitter, YouTube, etc) or intermittently blocked or slowed down (Gmail, blogger, Wikipedia, etc) depending on the political atmosphere.
Businesses often need access to these vital information stores, and it is accepted practice for Chinese professionals to subscribe to a proxy or VPN service to route their Internet connection outside of China. Then, China’s “Great Firewall” only sees an encrypted connection to an unknown computer outside of China. It’s a lot like money laundering, where all of the money is fed through a seemingly legitimate middleman, but instead of money it’s data. This works fairly well, and there are plenty of business people and academics who pay for this kind of service.
The problem with this approach is that the middleman knows who you are, what websites you are visiting, and what information you are putting into those websites. For most business people in China, this is fine, as their transactions are relatively benign business deals. But few can afford to pay a middle-men for all this work. And, if you are a human rights activist () researching non-violent activism you could get locked up, beaten, tortured, or killed if the VPN provider’s network doesn’t provide good enough anonymity.
Tor us much like a proxy or VPN, but instead of one server proxying the data, Tor volunteers run multiple proxy’s (called relay nodes ) which mixes encrypted traffic between them. This creates a free network where it’s hard to identify where traffic is coming from and going to. However, being run by volunteers on their home internet means that Tor much slower than a typical VPN and page-load times can average 8* seconds or more. This makes Tor a workable solution for the hardcore activist, who needs perfect security and anonymity, but not an average person who is casually interested in the truth about the Tiananmen square protests. If one is trying to build the critical mass of informed citizens needed for governmental reform, Tor is useless.
The Tor engineers have largely attacked the speed issue by tweaking routing algorithms, encouraging more volunteers to run nodes, and directly funding more exit nodes. I wanted to know how large of an impact prioritizing certain types of traffic would make and experiment with ways to encourage businesses to connect their servers to and financially support Tor directly.
It is a truism among server administrators that a large amount of traffic coming from Tor exit nodes is spam. Being a free, anonymizing network, Tor attracts spammers, scammers, file-sharing users, pedophiles and others who wish to hide from law enforcement. Information within the Tor network is encrypted (), meaning that no-one can read the information. However, the whole point of Tor is to allow connections to the regular Internet. Special nodes (called exit nodes ) get a decrypted, readable version of the information () before relaying it onto the regular Internet. While there have been efforts to block file-sharing traffic, Tor exit nodes could analyze outgoing traffic and block or de-prioritize traffic that appeared to be spammy.
To estimate the portion of exit traffic that these undesirables were taking up, I contacted researchers whom had performed traffic analysis on the Tor network previously. Both groups had deleted their logs. Setting up an exit node and sniffing Tor exit traffic turned out to be not only time consuming, costly, and a bit beyond my technical capabilities, it’s legally dubious as well. I tried contacting and Akismet and Cloudflare, two medium-large sized companies with big data sets that detect and filter spam bots. I had hoped they could provide the data needed to determine the ratio of spam-bot/human traffic coming from Tor exit proxies. Cloudflare was unable/unwilling to devote the manpower, Akismet never bothered to respond.
I contacted Michael Hampton, the administrator behind the FOSS spam bot detection software Bad Behavior, asking for a referral to any high-volume website administrators that might be willing to scan their logs. He actually went about setting up a large scale data collection mechanism for Bad Behavior, but he eventually got so bogged down that he dropped the project. I very much appreciate his attempt to help me, and I hope he can get around to it someday.
For now, however, prioritizing traffic based on bot detection signatures will remain an unknown quantity. Futhermore, sniffing outgoing traffic (even if it is automated) tends to run against the ethical grain of the administrators who feel compelled to donate time and money to providing a free proxy service. There are also other academics looking into Quality of Service tagging and bandwidth sharing incentive schemes.**
Reducing Hops by Moving Sites Into Tor Network
In late September of 2010, the privacy preserving search engine DuckDuckGo turned their main internet server into a Tor network node, making DuckDuckGo an anonymous search engine. DDG became an enclave node, a node that would be used for all DuckDuckGo connection in the Tor network instead of regular exit nodes. Since it should skip the exit relay, as the Tor exit enclave documentation led me to believe, there would be one less hop. Theoretically, the DDG website would load 30% faster.
I was interested in trying to promote enclave nodes for two reasons. The first was that enclave nodes can connect to regular relay nodes, skipping exit relays. Exit relays are difficult to maintain, as they appear to be the source of spam, hacking, file-sharing, and other unpleasant network traffic. Exit node operators tend to receive a lot of complaints from the RIAA and even an occasional email from law enforcement. Although rare, a few exit node operators have had their computers confiscated as evidence and they have even arrested in child porn raids. Operators are now very proactive at alerting others that they are running a Tor node, and are not the source of the traffic. This means that there are fewer exit nodes than regular relay nodes, which reduces anonymity and led me to assume that exit nodes were a major choke-point on the network.
Considering the how the the top ten sites account for 20% of the top 1000 sites traffic, it’s hard not to question the impact enclave nodes could have on the Tor network. Last year, Google stopped censoring it’s search results, and China started blocking Google. While Yahoo and Microsoft voluntarily filter their own search results, but China has been jamming or slowing down services even with the foreign companies that “play ball” in order to promote homegrown Chinese companies.
According to a 2010 study which categorized the destination websites of headers from captured exit node packets, some 24% of HTTP traffic was going to search engines, social networking sites, and blogs. Some quick Googling will revealed that of the top ten websites, six are regularly censored in China (
YouTube, Yahoo, Live.com, Wikipedia.org, msn.com, BlogSpot, Baidu.com, Bing.com)* Even if those six sites only represent (at max) 10% of the overall exit node bandwidth, the vast majority of sites that the Tor’s target audience visits stand to gain big improvements in page load time.
The second benefit to this approach is that, if load times became bearable**(which is a really hard metric to pinpoint, as it varies according to task and relative connection speeds), Tor could provide Google, Facebook, and others access to the Chinese market of regular internet users. If Tor became a viable option for reaching these normally benign services, and Tor usage became wide-spread for non-dissident activities (like business use), China might just have to swallow the presence of Tor on it’s network. Most importantly, however, is that if Google and others would actively contribute to making Tor faster and safer in exchange for access to the massive Chinese market place.
The testing consisted of using a Tor enabled version of Firefox to issue DuckDuckGo a search query of “my IP address” and comparing load times between an enclave and a non-enclave node servers. Non-tor queries and the use of DuckDuckGo’s hidden .onion search portal served as controls to isolate network variance unrelated to the Tor network.
During this past quarter, I was exploring end-user-programming as well. The social sciences tend to attract those who are not naturally inclined to program. This results in undergrads slaving away on boring grunt work that could be automated. I decided to field test how much of the test-bed could be automated with little or no programming.
My Mac Mini file server served as the host machine for a VirtualBox Lubuntu instance loaded with the portable version of Firefox running the Tor browser suite. To ensure local internet usage did not dirty the results, VirtualBox was connected to a dedicated connection, a DSL line installed for the purpose of this testing. Firefox’s page caching was disabled and the Firebug add-on provided load and rendering times.
I attempted to use the iMacros and Selenium automation suites for Firefox, as both provide a “watch-me-do” interface for demonstrative programming. However, both provided limited programmability of elements outside of the web-page. Copying load times from the Firebug add-on and turning Tor on and off proved to be impossible.
Up until this point no programming was needed. A researcher could fully automated almost any web-based task without any programming language or background. The instructions for setting up such a system could be written in an entirely procedural manner, requiring no understanding of the underlying system or abstract data structures.
In a compromise to my “no programming” design path, I chose to use the Sikuli automation IDE. Sikuli uses computer vision to identify interface elements instead of explicitly labeled interface components, as iMacros, Selenium, AppleScript, and other interface automation suites rely on. However, Sikuli relies on lightweight python scripting and functional scripts can be created using the pattern-match behavior typical of end-user programmers.
Up to this point, I had prototyped each piece of the automated test-bed on my MacBook laptop. This took a significant amount of time and effort, and I wisely decided to do some pilot testing and post the set-up on the Tor mailing list before dedicating more time to assembling and debugging the test-rig on the server machine.
The first sign of trouble came when Roger Dingledine tested the DDG website against Tor’s exit node list. DDG was not listed. It turned out that DDG had made some changes which disabled the enclave proxy.
After working with the DDG engineers to fix the configuration issue, I realized I needed a way to reliably access the DDG server through a tor exit relay and not the DDG exit enclave. The Tor DDG engineers gracefully provided a CNAME for the search engine which did not reside on the same IP address as the exit enclave.
After installing LoadUI and a Tor proxy on my server, I began measuring download and response times between the exit node and exit enclave sites. After a few hours, there was no difference. The exit node node circuit should have contained an additional hop compared to the enclave node. Additionally the enclave node circuit only used the regular relay nodes in each hop, which were faster and more abundant than the exit nodes.
Given that the enclave node had 1/3 fewer hops and that it wasn’t restricted by the exit node bandwidth, a difference of thirty percent between the exit and enclave DDG searches was a realistic expectation.
But even after days of testing, there was no difference in response times from the exit and enclave circuits. LoadUI does not download pages like a normal web-browser, so I had expected to see less variance compared to browsers load times. However, there was no discernable difference at all.
Thinking that I may have misconfigured something, I setup testing for all four DDG conditions (exit node, enclave node, .onion site, and non-Tor control). The non-Tor search would only provide a baseline response time, not any differences within the Tor network. The .onion search response times should (having twice the number of hops) have twice the response time as the regular exit node DDG search. Indeed, the .onion search was twice as long as the normal DDG page loads. Something was wrong, and it wasn’t with my test set-up.
False Assumption #1: an enclave nodes replaces the final relay node
I logged onto the Tor IRC chat channel and received an explanation from Sebastian. Despite what the documentation indicated, enclave node circuits do not contain fewer hops, as depicted below.
Connections with an enclave node provide two benefits
- Enclave nodes do not rely on exit nodes, but instead pull from any regular relay node.
- No volunteer node can read the information being passed to the Enclave Node.
The reasoning for not having one fewer hop is somewhat complicated, and I was in the good company of MIT grads, Slashdot, and others in making this mistake. Since the Tor network is made up of volunteer nodes, an adversary could create hostile nodes on the network and piece together enough information to find the dissident.
There are two pieces of information that cannot be connected to the dissident
- What information the dissident is sending
- Where the user is sending that information.
If an enclave node were constructed like we had assumed (with one fewer node/hop than a regular circuit) a hostile node can connect where the information is going to whom sent that information.
The red, hostile node () cannot read the information being passed to the enclave node, but the hostile node does know that the information it is passing is meant for the enclave node only. That information won’t be passed along any further. The hostile node also knows that the orange node knows who the information came from. To explain why takes a lot of background information, so I stuffed it in a skippable info box.
Instead, each user keeps a short list of entry nodes and they reuse those entry nodes for every session Compared to the continually random entry node assignment, over a long period of time, it is less likely that the entry node picked will be a hostile node. This has the downside of making clients of a single entry node much easier to decipher, as they use the same subset of entry nodes over and over again.
In a 3 hop circuit, the middle node () doesn’t know where the information originally came from or where it’s final destination is. An adversary has to compromise two nodes at any given time in order to connect the user and the destination. A hostile node at the end doesn’t can’t see past the green node, as it carries multiple user’s traffic.
At least I am in the good company of Slashdot and MIT graduates in making this mistake. If you wish to not make mistakes regarding 2 vs 3 hop designs, the research paper On the Optimal Path Length for Tor provides an excellent overview of the pros and cons.
False Assumption #2: there are not enough exit nodes
From scary stories regarding exit node operators being arrested and the huge amount of bandwidth that researchers saw when operating exit nodes, I naturally assumed exit nodes were in short supply. You had to be an elite geek running your own ISP to be able to afford the bandwidth and not get arrested when connections were traced back to your company. Since running a relay node is simple and carries zero risk, it naturally follows that available non-exit relay bandwidth would be an order of magnitude greater than available exit relay bandwidth, especially given the scary quotes regarding exit nodes that are thrown around,
“Without our servers, roughly 25% of all exit traffic in the Tor network passes through one node, which is far from ideal. The currently fastest node, Blutmagie, will be shut down within the next months” –TorServers.net
But, according to the gracious Sebastion on IRC, exit node bandwidth isn’t that scarce and has never been major a problem. Even the 2009 paper on improving Tor’s speed doesn’t mention a desperate need for more exit relays, just more relays. When more was needed, it magically showed up.
Exit enclaves may still be a way to increase Tor speed in the future. Until fairly recently, China tolerated Tor. The Tor project created semi-secret entry nodes, called bridge nodes, to restore connectivity. A cat-and-mouse game has ensued. Tor admins are receiving has 8,000 email requests per day asking for new bridge nodes, there are ~1,000 public bridge nodes.
Unlike other peer-to-peer networks, Tor does not make every node/peer share their bandwidth. Nodes are not relay’s by default, but the needed architectural changes to make bridge-node-by-default automatic and scalable are nearly complete. The Tor operators plan on making fast nodes into bridge relays by default. The Tor network varies between 300,000 to 400,000 users, sometimes spiking up to 600,000 users.
There are just over 3,000 relays, a 100:1 ratio of users to relay nodes. And remember, each user’s connection must make additional hops before reaching the exit nodes. If Tor made every suitably fast node a regular relay-by-default, then the entry and inter bandwidth capacities will almost certainly be out of balance with the available exit node capacity. If that appears to be happening, I will try my experiment again.