Image annotation datasets are becoming larger and larger, with tens of millions of
images and tens of thousands of possible annotations. We propose a strongly
performing method that scales to such datasets by simultaneously learning to
optimize precision at k of the ranked list of annotations for a given image and
learning a low-dimensional joint embedding space for both images and annotations.
Our method both outperforms several baseline methods and, in comparison to them, is
faster and consumes less memory. We also demonstrate how our method learns an
interpretable model, where annotations with alternate spellings or even languages
are close in the embedding space. Hence, even when our model does not predict the
exact annotation given by a human labeler, it often predicts similar annotations, a
fact that we try to quantify by measuring the newly introduced ``sibling''
precision metric, where our method also obtains excellent results.