Alignment faking in large language models

Alignment faking in large language models (anthropic.com)

×1.94 | 302 points by adultorata 6 months ago | 353 comments | ▲ ▼ ??? ???

Story Stats

This chart shows the history of this story's rank on the Hacker News "Top" (Front) Page, "New" Page, and "Best" Page, as well as its raw rank given the Hacker News ranking formula.

This chart shows the history of this story's upvotes compared to the expected upvotes for stories shown at the same ranks and times.

This chart shows the history of this story's estimated true upvote rate: the predicted long-term ratio of upvotes to expected upvotes.