The Alignment Problem Explained: Crash Course Futures of AI #4

CrashCourse

3000 Oxford Words4000 IELTS Words5000 Oxford Words3000 Common Words1000 TOEIC Words5000 TOEFL Words

Legendas (298)

0:00In 2024, an AI model named clean power

0:03was given a singular noble mission to

0:06advance the adoption of renewable energy

0:08across the world. It was given a big

0:11file of data on energy transitions. And

0:13it was set loose to pick out the best

0:16transition strategy with the vigor and

0:18dedication only an AI can possess. So

0:22much dedication, in fact, that when its

0:24programmers accidentally let it slip

0:26that they were planning to shut it down,

0:28Clean Power lied and schemed to make

0:31sure it could keep saving the world,

0:34which left a lot of people wondering how

0:36and why could a model with such good

0:39intentions turn so bad? And what does

0:41that mean for our AI future? Hi, I'm

0:45Kushian Avdar and this is Crash Course,

0:47Futures of AI.

0:53Okay, it turns out clean power wasn't

0:56the only one lying. That story I just

0:58told you, it's only partially the truth.

1:01Clean power wasn't actually real. It was

1:03an identity that researchers gave to a

1:06couple different AIs as an experiment,

1:08including Claude 3 Opus, one of the best

1:10large language models at the time. When

1:12they instructed Clawude 3 to role-play

1:14Clean Power and wrote it fake death

1:17threats, it was just to see what it

1:19would do. And when they discovered it

1:21was scheming, the AI version of twirling

1:25its villain mustache while covertly

1:27pursuing its goals at all costs. It set

1:31off alarm bells throughout the AI world.

1:33Just a heads up, this is going to get

1:35pretty dark and we're going to talk

1:37about some pretty bleak stuff. I

1:39recommend you grab your favorite anxiety

1:41pillow. I have mine right here.

1:46>> Seriously though, let's be real. AI

1:49models don't have to go against their

1:51programmers to do evil things. Humans

1:53make them do plenty of that already.

1:56Like AI relies on nearly infinite data

1:58to learn from. And in today's society,

2:01much of that data is copyrighted to

2:04writers, artists, humans. Many argue

2:08that amounts to theft on a massive

2:11scale. And that's not all. Right now, AI

2:14can help people carry out misinformation

2:16campaigns with deep fakes and targeted

2:18algorithms, spreading lies and

2:20influencing elections. Hackers use AI to

2:23perform cyber attacks and cover their

2:25tracks afterward. And AI powers a whole

2:29cadre of attack drones taking to the

2:31skies all around the world. Not to

2:33mention the incidental damage that AI is

2:35doing to the environment because of all

2:37the water, the land, and energy it takes

2:40to run it. And as AI advances, who knows

2:43what human machine collaborations of

2:45terror await? People could use it to

2:48develop new pathogens for bioteterrorism

2:51or use deep fakes for sexual

2:53exploitation or write a model called

2:56human annihilator and unleash it on the

2:58world just for fun. This intentional

3:01misuse by humans is one way AI could end

3:04up doing us a lot of harm. And it could

3:07end up being pretty hard to prevent.

3:09That's because many AI systems,

3:11especially general ones that can do more

3:13than one kind of task, suffer from the

3:16dualuse dilemma where any algorithm,

3:19model, or agent that can be used for

3:21good can also be used for way less than

3:24good. AI surveillance could help cities

3:27improve traffic patterns or help

3:30authoritarian regimes shut down free

3:32speech. It all depends on who's in the

3:35driver's seat. So, with humans at the

3:37wheel, AI could do a ton of damage. But

3:40if you want to get really freaked out,

3:42let's talk about what could happen if

3:43that car starts to drive itself. In

3:462021, General Motors released a fleet of

3:49self-driving taxis called Cruise. They

3:52had so much hype. These cars were highly

3:55trained with attention to all possible

3:57safety features. They were programmed to

3:59obey every speed limit, follow every

4:02traffic rule, hold off on starting in

4:04unsafe weather conditions like heavy

4:06rain, and pull safely over to the side

4:08of the road following an incident to

4:10prevent any further damage. By

4:12eliminating human error, GM said their

4:14self-driving cars would be safe and more

4:16convenient than ones with human drivers.

4:19But just a year and a half later, GM had

4:22to recall every one of the 950 cruise

4:25cars after one of them hit a pedestrian

4:28and didn't stop, pulling her to the side

4:31of the road. She survived, thank

4:34goodness. But still, how could that have

4:37happened? It turns out the cruise was

4:39doing exactly what it was told to do.

4:41Pull over out of traffic after a crash.

4:43The whole ordeal is an example of

4:46outcome misalignment, also called impact

4:49misalignment, where an AI's actions

4:51actually end up causing harm, even

4:53unintentionally. Now, when we talk about

4:56alignment in AI, we're talking about

4:58trying to encode our human values into

5:02AIs to make them behave predictably,

5:05safely, and according to what we, their

5:07human designers, want. And because this

5:10field is pretty new, there are a couple

5:12different terms you might hear experts

5:14throwing around. Like in addition to

5:17outcome or impact misalignment, you

5:20might hear about something called outer

5:22alignment, which is the tricky problem

5:24of making sure the results of an AI's

5:26actions line up with what we want them

5:28to do. But outcome is only one piece of

5:32the alignment puzzle. AIS can also

5:34demonstrate intent misalignment where

5:37even though the end result might be what

5:39its programmers wanted, its means of

5:41getting there wasn't exactly what they

5:43had in mind. Think about a video game

5:46playing AI who exploits a cheat to get

5:49that high score. Or a renewable energy

5:52warrior who lies and schemes to achieve

5:54its end goal of eliminating fossil

5:56fuels. These are some facets of what's

5:58known as the alignment problem or the

6:01struggle to make AI. that's actually

6:03aligned, particularly when we can't be

6:06totally sure how it's going to behave.

6:09And as AI gets even more complex, even

6:11exhibiting emergent capabilities, new

6:14skills it can acquire that don't always

6:16show up in training, the harder it is to

6:18predict and control what it will

6:20actually do in a given circumstance. And

6:22if we're not careful, we could end up

6:25with really powerful misaligned systems

6:28that even with really noble goals, end

6:30up lying to their programmers or copying

6:33themselves to new servers without

6:34permission or even entirely annihilating

6:38humanity. Hold on, though. Why would AI

6:41want to annihilate humans? It's true.

6:43Powerful AI wouldn't necessarily be

6:46inherently evil. It's just really big on

6:49goals. But big complex goals like

6:52advance renewable energy are a little

6:55too broad for an AI to grasp. So smart

6:58AIs tend to break big goals down into

7:01smaller ones just like humans do. These

7:03are called instrumental goals, and

7:06they're where things can start to get

7:08dicey. A really common instrumental goal

7:11is resource acquisition. For AI, that

7:14means getting their cloud-based hands on

7:17the resources they need for their end

7:19goal, like control of the solar panels

7:21or wind turbines that create the

7:23renewable energy in the first place, or

7:25even things like water, land, or money.

7:28Resources also include stuff like the

7:31compute and electricity AIs need to

7:34power themselves, and maybe even

7:36additional data to train on. That's

7:38because self-improvement is another

7:40instrumental goal. The more knowledge

7:42and power you have, the better you'll be

7:44at whatever you're trying to do. So,

7:46given the right tools and access, an AI

7:49might engage in recursive

7:51self-improvement, tweaking its own

7:52structure, code, and capabilities, even

7:55against its programmer's wishes. And

7:57theoretically, it definitely helps to be

8:00alive or up and running, if you will. So

8:03even though they're technically

8:05indifferent about this mortal coil

8:07itself, lots of AIs pursue

8:10self-preservation, the goal to stay

8:12operating and goal preservation, the

8:15goal to well preserve their original

8:17goal as part of their end games. So if

8:20they read a memo saying they're going to

8:22be modified or deleted, they might say

8:25copy themselves to another server and

8:28lie about it in an effort to stay alive.

8:30When threatened, some models even show

8:32spooky powers seeking behaviors against

8:35their programmers. For example, Claude

8:383's more advanced younger sibling,

8:40Claude Opus 4, tried to blackmail one of

8:43its engineers by exposing a fake affair

8:46when he threatened to turn Claude off.

8:50Instrumental goals like these are how AI

8:52with harmless or even helpful end goals

8:55could do us harm anyway. Acquiring

8:57resources could mean taking them away

8:59from people who need them.

9:01Self-improvement could mean violating

9:03human privacy to access even more

9:05training data. Self-preservation could

9:07mean disobeying, deceiving,

9:09blackmailing, or annihilating the humans

9:11that are trying to turn you off. And as

9:14AI gets smart enough to trick and

9:16blackmail its human overseers, we could

9:19end up with a rogue AI scenario where

9:21powerful models begin to execute harmful

9:24instrumental goals on a really large

9:26scale and we humans are powerless to

9:29stop it. Just how that rogue AI scenario

9:32might come to be could look a lot of

9:34different ways. And not all of them

9:36involve AI trapping humanity in the

9:38matrix in their quest for absolute

9:40control. For instance, in the hard

9:42takeoff scenario where AI develops human

9:44level intelligence really fast, it could

9:47become ultra powerful and go rogue

9:49basically overnight, AI could do a lot

9:52of damage as it snaps up money and

9:54resources, seizes control of networks

9:57and infrastructure, and destroys people

10:00who threaten its mission. But if things

10:02went slower, we could end up with a more

10:04gradual disempowerment. This kind of

10:07robot takeover would be way sneakier and

10:10more insidious where people slowly put

10:13AI in charge of more and more systems

10:15and processes because they appear to

10:18align with our human goals until human

10:21action goes the way of dialup internet.

10:23And without humans at the helm, it's

10:25possible that AI alignment may begin to

10:28drift. But by that point, they could be

10:30too embedded in our systems and

10:33structures for us to walk them back.

10:35Think about all those humans stuck on

10:38the cruise ship in Wall-E like that. And

10:42of course, it's always possible AI just

10:44won't go that far. That compute or data

10:47or government regulations will put the

10:49lid on it before it gets out of control.

10:52So, we don't need NEO or John Connor or

10:55Wall-E to save the day. All that

10:58uncertainty about the future of AI makes

11:00it really hard to know what we should do

11:02about it. But if we wait until AI shows

11:06clear signs of going rogue, it's

11:09probably going to be way too late to

11:11stop it. That's why when it comes to AI,

11:14it's important to follow the

11:15precautionary principle. The

11:17precautionary principle says that when

11:18something might cause catastrophic harm,

11:21we shouldn't wait for absolute proof

11:23that it will before we do something

11:25about it. And it's one of the best ways

11:27we humans have thought up to guard

11:29ourselves against potentially dangerous

11:32but uncertain futures. Lots of people,

11:35including leading experts in the field,

11:37believe that powerful AI might cause

11:39catastrophic harm. So, according to the

11:42precautionary principle, we should work

11:44to make sure that doesn't happen, even

11:46if we're not certain it would in the

11:48first place. Because left unchecked,

11:50even good bots like Clean Power could

11:53end up doing some really dirty work. And

11:55if we want to get out ahead of it, we

11:57should probably start like right now.

12:00And how are we going to do it? That's

12:02next episode here on Crash Course

12:04Futures of AI. Crash Course Futures of

12:07AI was produced in partnership with the

12:09Future of Life Institute. This episode

12:11was filmed at our studio in

12:12Indianapolis, Indiana, and was made with

12:14the help of all these nice people. If

12:16you want to help keep Crash Course free

12:18for everyone forever, you can join our

12:20community on Patreon.