Question 1 of 25
Image segmentation differs from classification because:
Answer options for question 1 A. Segmentation only works on grayscale images — color photos need classificationB. Segmentation assigns a class label to every pixel, not one label per imageC. Classification always requires more GPU memory than any segmentation modelD. They are identical tasks that differ only in the name used in research papers
Question 2 of 25
Semantic segmentation means:
Answer options for question 2 A. Only axis-aligned bounding boxes are predicted around each object regionB. Only background pixels receive labels — foreground objects stay unlabeledC. All pixels of the same class share one label — e.g. every “person” pixelD. Each object instance gets a unique ID even when the class name is identical
Question 3 of 25
Instance segmentation adds what semantic segmentation cannot do?
Answer options for question 3 A. Run dense prediction without using any convolution layers in the networkB. Remove the need for any pixel-level training labels during supervised learningC. Predict only one merged foreground blob for all objects of the same classD. Distinguish person #1 from person #2 with a separate mask per instance
Question 4 of 25
Panoptic segmentation combines:
Answer options for question 4 A. Semantic “stuff” regions (sky, road) plus instance “things” (cars, people)B. Only bounding boxes and class names without any per-pixel mask outputC. Joint audio spectrogram and video frame prediction in a single headD. Unsupervised training with no pixel-level annotations of any kind required
Question 5 of 25
DeepLab ASPP (Atrous Spatial Pyramid Pooling) is designed to:
Answer options for question 5 A. Tokenize image patches into subword units for transformer self-attention onlyB. Eliminate the need for any pretrained backbone when training on street scenesC. Capture multi-scale context by pooling features at several dilation rates in parallelD. Replace the decoder entirely so the network outputs only bounding boxes
Question 6 of 25
Why is dense prediction harder than image classification?
Answer options for question 6 A. Cross-entropy loss is mathematically undefined for per-pixel classificationB. It requires no labeled data because self-supervision replaces all annotationsC. It only works on 1×1 images — larger inputs cannot be segmented at allD. You must predict H×W labels and keep spatial alignment with the input image
Question 7 of 25
The key idea of U-Net skip connections is:
Answer options for question 7 A. Concatenate high-resolution encoder features into the decoder to recover fine edgesB. Skip training the decoder entirely and predict masks from the encoder onlyC. Replace all convolution layers with fully connected layers in the expanding pathD. Connect only the first encoder layer to the last decoder layer and nothing else
Question 8 of 25
In U-Net, the contracting path (encoder) mainly:
Answer options for question 8 A. Predicts final per-pixel class logits directly without any decoder stageB. Downsamples spatially and increases receptive field to capture broader semanticsC. Removes all channel dimensions so only a single grayscale plane remainsD. Upsamples feature maps stepwise until they match the original image resolution
Question 9 of 25
The expanding path (decoder) in U-Net uses:
Answer options for question 9 A. Cross-attention over a text vocabulary instead of spatial feature tensorsB. Only repeated max pooling to shrink maps further before the output headC. Recurrent LSTM cells applied independently to each pixel in raster orderD. Upsampling or transposed conv plus conv to grow spatial size toward input resolution
Question 10 of 25
U-Net was originally designed for:
Answer options for question 10 A. Biomedical image segmentation with limited training data — e.g. microscopy cellsB. Large-language-model text generation with autoregressive next-token predictionC. Spam email filtering using bag-of-words features on plain-text messagesD. Stock price forecasting from historical time-series tabular market data
Question 11 of 25
IoU (Intersection over Union) for a binary mask compares:
Answer options for question 11 A. Training loss averaged across epochs rather than spatial mask overlapB. Overlap area divided by union area of the predicted mask and ground truthC. Total model parameter count relative to the number of training imagesD. Only the raw count of correctly labeled pixels, ignoring false positives
Question 12 of 25
Pixel accuracy alone can be misleading when:
Answer options for question 12 A. IoU is also reported alongside accuracy in the evaluation summary tableB. Every class occupies roughly equal pixel area across the full training setC. The background dominates most pixels — severe class imbalance skews the scoreD. You use a U-Net architecture with skip connections in the decoder path
Question 13 of 25
Dice coefficient is closely related to:
Answer options for question 13 A. F1 score on binary masks — emphasizes overlap between prediction and truthB. Learning rate scheduling rules that decay the optimizer step over timeC. Batch normalization statistics computed across the mini-batch dimensionD. Token perplexity used to evaluate language model next-word prediction quality
Question 14 of 25
For multi-class segmentation training, a typical loss is:
Answer options for question 14 A. Contrastive loss that pulls together embeddings of two text sentencesB. Cross-entropy computed independently at each pixel across the H×W output gridC. Reinforcement learning reward signal with no pixel-level supervision at allD. Mean squared error on raw RGB pixel intensities instead of class logits
Question 15 of 25
Atrous (dilated) convolution in DeepLab mainly helps by:
Answer options for question 15 A. Working only on single-channel grayscale inputs — RGB images are unsupportedB. Always reducing image width and height at every layer in the decoder stackC. Is mathematically identical to max pooling with a 2×2 window and stride 2D. Enlarging receptive field without further shrinking spatial resolution via pooling
Question 16 of 25
Portrait background removal in mobile apps is usually:
Answer options for question 16 A. Linear regression on housing features to predict property price in dollarsB. Autoregressive next-token prediction over a subword text vocabularyC. Binary or multi-class segmentation that separates person pixels from backgroundD. K-means color clustering with no neural network and no learned weights
Question 17 of 25
After the final U-Net layer you typically apply:
Answer options for question 17 A. Byte-pair encoding merge rules to tokenize the output tensor as text stringsB. 1×1 conv for C class logits per pixel, then softmax or argmax over classesC. No output head — raw encoder features are used directly as final predictionsD. Global average pooling over the full map to produce one label for the image
Question 18 of 25
RoIAlign in Mask R-CNN improves on RoI Pooling because it:
Answer options for question 18 A. Requires no backbone features — masks are drawn from random noise tensors onlyB. Replaces instance masks with a single semantic map for the entire input imageC. Eliminates the region proposal network so boxes are predicted without proposalsD. Uses bilinear sampling at continuous coordinates — avoids harsh quantization of region features
Question 19 of 25
If predicted mask is entirely empty but ground truth has a large object, IoU is:
Answer options for question 19 A. 1 — an empty prediction is treated as a perfect match in all evaluation setupsB. 0.5 — IoU defaults to the midpoint whenever the prediction has zero areaC. 0 — there is no intersection between an empty prediction and the truth maskD. Undefined — empty predictions are always excluded from IoU computation
Question 20 of 25
Compared to a plain encoder–decoder without skips , U-Net usually:
Answer options for question 20 A. Produces sharper object boundaries by restoring fine spatial detail in the decoderB. Cannot be trained on GPUs — skip connections require CPU-only executionC. Outputs only bounding boxes instead of a dense per-pixel class probability mapD. Uses fewer parameters in every configuration because skips remove conv layers
Question 21 of 25
Mask R-CNN extends Faster R-CNN by adding:
Answer options for question 21 A. Global average pooling so the network outputs one label for the whole imageB. A language-model decoder that captions each bounding box with natural textC. A small mask head that predicts a binary mask inside each detected region of interestD. Only semantic segmentation — all instances of a class share one merged mask
Question 22 of 25
Object detection outputs ___ ; semantic segmentation outputs ___.
Answer options for question 22 A. Sparse boxes plus class labels; a dense per-pixel class map across H×WB. One caption word per image; audio spectrograms for each detected objectC. Identical output tensors in both tasks — only the loss function name differsD. Confidence scores only with no spatial location information of any kind
Question 23 of 25
Mean IoU (mIoU) across classes:
Answer options for question 23 A. Cannot be computed when more than two semantic classes exist in the datasetB. Always ignores the background class and never includes it in the averageC. Is identical to top-1 image classification accuracy on the validation setD. Averages per-class IoU — treats rare classes more fairly than raw pixel accuracy
Question 24 of 25
A double conv block (conv → ReLU → conv → ReLU) in U-Net:
Answer options for question 24 A. Eliminates skip connections by merging all scales into one tensor automaticallyB. Extracts richer local features at each scale before pooling or upsamplingC. Replaces the need for any pixel-level labels during supervised trainingD. Is used only in the final 1×1 output layer, not in encoder or decoder blocks
Question 25 of 25
A common reason to use pretrained DeepLab instead of scratch U-Net on natural photos is:
Answer options for question 25 A. ImageNet backbone + ASPP already encode edges and multi-scale context — faster fine-tuneB. DeepLab cannot output dense masks — it only supports image-level classificationC. U-Net always requires more GPU memory than any DeepLab variant at equal resolutionD. Pretrained weights are unavailable for any public segmentation architecture