Evaluate Models
Inception Score (IS): Measures the quality and diversity of generated images.
Fréchet Inception Distance (FID): Evaluates the distance between the distributions of real and generated images.
Perceptual Similarity: Measures the perceptual similarity between generated and reference images.
Mean Opinion Score (MOS): A human evaluation metric for the perceived quality of generated images.
Structural Similarity Index (SSIM)/ Multiscale Structural Similarity (MS-SSIM): Measures the structural similarity between generated images and reference images, focusing on luminance, contrast, and structure.
IS: Measures the quality and diversity of generated images by evaluating the confidence of a pre-trained Inception model on the generated images. It is calculated using a pre-trained Inception model, making it relatively simple and computationally efficient to implement. It is widely used in evaluating image generation models.
2.9. Image Captioning: Generating descriptive text for images.
BLEU Score: Measures the precision of n-grams in the generated caption compared to a reference caption.
ROUGE Score: Evaluates the overlap of n-grams or subsequences between the generated and reference captions.
METEOR Score: Considers precision, recall, and synonyms for evaluating generated captions.
CIDEr: Measures the consensus in image captioning by comparing the generated caption to multiple reference captions.
BLEU Score: has since become a standard for evaluating various natural language generation tasks, including image captioning is the most popular metric for image captioning due to its simplicity, efficiency, and widespread adoption. It's easy to compute and provides a quick measure of how closely generated captions match reference texts by focusing on n-gram precision
3. Text - NLP domain
3.1 Group 1: Based on Truth Label
a. Sentiment analysis: Determining the sentiment or emotion expressed in a text.
b. Text Classification: Assigning a category or label to a given text.
c. Named Entity Recognition (NER): extract entities in a piece of text into predefined categories such as personal names, organizations, locations, and quantities
d. Part-of-Speech (POS) Tagging: : Identifying and classifying entities (e.g., names, dates, locations) in text.
Accuracy or F1-Score
3.2 Group 2: Text to Text (Seq2Seq) - Image2Text
a. Machine Translation: automates translation between different languages
b. Text generation (Text-to-Text Generation, Image to text Generation)
Output is text
Autocomplete: Predicts what word comes next, and autocomplete
Chatbots
Captioning for image (Detail at 1.9)
…
c. Text Summarization
d. Text-based Question Answering(QA)
BLEU score: comparing n-grams (sequences of words) in a generated text (e.g., machine-translated text, generated caption) to those in one or more reference texts
3.3 Others
a. Language Modeling (Topic modeling)
An unsupervised text mining task that takes a corpus of documents and discovers abstract topics within that corpus
Perplexity: a metric commonly used in natural language processing to evaluate the quality of language models, particularly in the context of text generation Perplexity quantifies how well a language model can predict the next word in a sequence of words. It is calculated based on the probability distribution of words generated by the model
b. Visual Question Answering - VQA - Answering open-ended questions based on an image
VQAScore: measures the accuracy and relevance of answers generated by a VQA model in the context of the questions and images provided. It evaluates how closely the model's answers align with human-provided answers.
Last updated