AI Glossary

F1 Score

The F1 score combines precision and recall into a single number — their harmonic mean. It's high only when both are high, which makes it a fairer summary than plain accuracy when the classes are imbalanced.

Also known as: F1

· Chain of Thought

AI Evaluation & Reliability

The F1 score answers a common need: one number that reflects both precision and recall. It’s their harmonic mean, which — unlike a simple average — stays low if either one is low. So a system can’t game F1 by acing precision while missing everything, or vice versa; it has to be decent at both.

F1 earns its keep when classes are imbalanced, where plain accuracy lies. If 99% of cases are negative, a model that always says “negative” scores 99% accuracy and is useless; its F1 exposes that. The limits: F1 weighs precision and recall equally, which may not match your costs (weighted variants like F-beta let you tilt it), and it’s defined per class, so on multi-class problems how you average matters. Use it as a balanced summary, but go back to precision and recall when you need to know which side is weak.