He breaks the SIAI argument down to four primary points:
- If one pulled a random mind from the space of all possible minds, the odds of it being friendly to humans (as opposed to, e.g., utterly ignoring us, and being willing to repurpose our molecules for its own ends) are very low
- Human value is fragile as well as complex, so if you create an AGI with a roughly-human-like value system, then this may not be good enough, and it is likely to rapidly diverge into something with little or no respect for human values
- "Hard takeoffs" (in which AGIs recursively self-improve and massively increase their intelligence) are fairly likely once AGI reaches a certain level of intelligence; and humans will have little hope of stopping these events
- A hard takeoff, unless it starts from an AGI designed in a "provably Friendly" way, is highly likely to lead to an AGI system that doesn't respect the rights of humans to exist
If someone builds an advanced AGI without a provably Friendly architecture, probably it will have a hard takeoff, and then probably this will lead to a superhuman AGI system with an architecture drawn from the vast majority of mind-architectures that are not sufficiently harmonious with the complex, fragile human value system to make humans happy and keep humans around.Goertzel then expresses his particular concerns with this argument, including SIAI's Eliezer Yudkowsky's suggestion that we can get human values into an AGI system, what he calls Coherent Extrapolated Volition:
...I think this is a very science-fictional and incredibly infeasible idea (though a great SF notion). I've discussed it and proposed some possibly more realistic alternatives in a previous blog post (e.g. a notion called Coherent Aggregated Volition). But my proposed alternatives aren't guaranteed-to-succeed nor neatly formalized.Oooh, it looks like we have the makings of a great debate, here. I'll be interested to see if the SIAI retorts and how they address Goertzel's concerns.
But setting those worries aside, is the computation-theoretic version of provably safe AI even possible? Could one design an AGI system and prove in advance that, given certain reasonable assumptions about physics and its environment, it would never veer too far from its initial goal (e.g. a formalized version of the goal of treating humans safely, or whatever)?
I very much doubt one can do so, except via designing a fictitious AGI that can't really be implemented because it uses infeasibly much computational resources. My GOLEM design, sketched in this article, seems to me a possible path to a provably safe AGI -- but it's too computationally wasteful to be practically feasible.