When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains | Read Paper on Bytez