On the Brink of True Distributed Computing

[ The following is a repost from my MySpace blog, which is not accessible unless you have an account there. Also, the audience there isn’t really interested in this stuff :-) ]

The notion that the “network is the computer” – or at least that it could be – has been around for a while. But all actual implementations to date are either too specialized (e.g. SETI@home) or simplistic (e.g. p2p file-sharing, viruses, DDoS attacks) to be used for generalized computation, or are bound at some critical bottleneck of centralization. To this latter point, search engines hold promise, but the ones we are familiar with like Google are reliant on both central computational control (for web crawling and result retrieval) and central storage (for indexing and result caching). Lately social bookmarking/tagging has been used by those opting in to distribute the role of crawling, retrieval and indexing. It remains to be seen whether keyword tags and clusters thereof are semantically strong enough in practical terms to support general computation. Regardless, whatever heavy lifting is not supported by the representation level will end up falling on the protocol and computational levels. On the other end of the spectrum, the specialized and computationally intensive projects have the issue of how to divide the labor and coordinate results, and no efforts to date have yielded a way to generalize distributed computation without a high degree of specialized programming.

If we look at the various computational models that are theoretically strong enough (in the Turing sense) to do generalized work, there is a spectrum created by the tradeoff between more semantically rich representation and more powerful atomic computation. Using arbitrary web pages as the representational level requires too much processing to get much more than search results based on keywords, and it stretches the limits of our ability to bridge the semantic gap between humans and computers. The Connection Machine model places too much of the representation and programming problem on the shoulders of sophisticated programmers. The Cyc approach seems reasonable in terms of balance, but the problem is still one of centralization in the form of knowledge engineers and a very semantically rich representational language – one that is hard for the average person to understand. If we are going to break down all the central bottlenecks, the representation must be one that is easy for the average human to contribute to, at least declaratively and post hoc. The social network model (e.g. wikipedia) is the logical extreme wherein humans are self-organized as a network computer, not only for creating representation but also for computation itself. Open-source development, social bookmarking/tagging, and other forms of social networking work the same way. The only trouble with this model is that humans are a very limited resource compared to a network of traditional computing devices – even if you take all five billion of us working together as efficiently as possible.

So what if we combine the best of social networking and Cyc to achieve the “sematic web” outlined by Tim Berners-Lee et al in the 2001 Scientific American article? Imagine that current web pages (possibly XHTML) can be annotated with just enough semantic structure that the average human can be a useful contributor via activities that they are already doing, namely wiki, tagging, browsing, searching, email, SMS, blogging, etc. Then define various meta-operations on top of search that effectively turns any search engine into a social theorem prover. Finally, create a p2p search protocol that is lightweight and extensible enough to harness the computational resources of any network enabled device via HTTP, SMTP, SMS, etc. The key to the search protocol is to remove all centralization, though certainly the Google API could be made to comply for some added horsepower and an initial boost.

Under the proposed model, the representation is handled as an emergent behavior of the social network, while the heavy computational lifting and storage can be truly and automatically distributed and handled by devices. The issues of coordination and control must be looked at differently than we normally think of them. Under the proposed model, the network becomes “smarter” and more useful for general computation based on the sophistication of the various meta-operators and the combined, continuous output of crawling, indexing and search. The way you “control” a computation is by creating a search whose explicit results and/or epiphenomena are what you want. If the necessary search operators don’t exist, then you can create a new one and extend the protocol for everyone’s benefit.