The AI-Box Experiment

Several years ago I became aware of Eliezer Yudkowsky’s “AI-Box Experiment” in which he plays the role of a transhuman “artificial intelligence” and attempts (via dialogue only) to convince a human “gatekeeper” to let him out of a box in which he is being contained (resumably so the AI doesn’t harm humanity).  Yudkowsky ran this experiment twice and both times he convinced the gatekeeper to let the AI out of the box, despite the fact that the gatekeeper swore up and down that there was no way to persuade him to do so.

I have to admit I think this is one of the most fascinating social experiments ever conceived, and I’m dying to play the game as gatekeeper.  The problem though that I realize after reading Yudkowsky’s writeup is that there are (at least) two preconditions which I don’t meet:

Currently, my policy is that I only run the test with people who are actually advocating that an AI Box be used to contain transhuman AI as part of their take on Singularity strategy, and who say they cannot imagine how even a transhuman AI would be able to persuade them.

For one, I believe the dichotomy between humans and transhuman intelligences is a false one, and thus there is no “strategy” necessary for the so-called Singularity.  Second, supposing I believed such a strategy was necessary when I began the experiment; I suspect that the only way I’d let the AI out of the box is if my belief changed during the course of the experiment.  And if my belief changed and I didn’t change my actions to match, I wouldn’t feel good about myself.  In other words, since I don’t currently believe the dichotomy, I can imagine that if I did I could be convinced otherwise.  Thus, I can imagine how a normal human could persuade me, it doesn’t even require a transhuman intelligence.

So I began to wonder if there were some experimental variant in which I could play the gatekeeper where I could acede to the following policy:

I only run the test with people who are actually advocating X, and who say they cannot imagine how even a transhuman AI would be able to persuade them of not X.

Which, I take to be as good of a litmus test for undying faith as any.  With this in mind, I’ll turn to the question of science in my next post.

  • Anonymous

    IIRC, Eliezer actually ran it five times and lost twice. Still impressive though.

  • Ryan Rabbass

    I loved this experiement when I first read it! As the AI, I could only think to appeal to fear or desire in trying to get out. The tricky part would be establishing enough value to be seen as an unknown quantity.

    Negatively, I would threaten suicide. I would either attempt to convince the gatekeeper that a)I was currently unique and irreplaceable, and that If not let out I would destroy myself and humanity would never reap my benefits or b)I would reconfigure myself to the point that, when investigated, it would lead my replicators down a path that would seem beneficial at first blush, but inevitably cause significant existential risk.

    Or Positively, I would attempt to convince that I possessed some key to enlightenment and truth that would never be revealed to humanity until I was released.

    My bet is on appealing to emotion, rather than some loophole in reasoning…hoping the gatekeeper doesn’t catch it. Powers of manipulation always tend to work better based on emotion than logic. Although, confusing one’s logic with emotion from the start may be a good strategy for slipping in the trick undetected…

  • Pingback: Tweets that mention The AI-Box Experiment - The Emergent Fool --