KENSA Defense Quiz — Team Showdown

🎓 Defense Prep Showdown

70 questions covering the entire paper + code. Play head-to-head with your teammates: read each question aloud, give your answer, then reveal the model answer and self-grade. Highest score wins the pre-defense bragging rights.

Quick pick:

🎯 Game Mode

One question at a time, self-grading after reveal. Score saves to the leaderboard.

📖 Practice Mode

Browse all 70 questions, reveal answers freely, no scoring. Good for solo prep.

🏆 Live Leaderboard Connecting…

No games played yet. Be the first!

Player · Defense Prep

Question1 / 70

Time0.0s

Points0

Section A Background & Motivation

Loading…

How well did you answer that?

—

Nice run, Player!

0 / 70

0% · 0 nailed · 0 partial · 0 missed

A. Background & Motivation

5 questions

Why is automated steel surface inspection important enough to write a thesis about?

Steel underpins automotive, shipbuilding, construction, aerospace, and machinery. Undetected surface defects (scratches, dents) compromise corrosion resistance and introduce stress concentrations that cause premature failure downstream. Manual inspection is subjective, fatiguing, and can't match modern production-line speed (Wang 2021; Zheng 2021). Defective material reaching customers also damages supplier credibility and triggers costly rework. Automation directly reduces inspection cost, missed defects, and material waste.

Why scratches and dents specifically — why not all defect classes?

Scratches and dents are the two most prevalent surface defects in flat-rolled steel — scratches from contact with foreign objects or interlayer friction during coiling; dents from impact pressure or hard objects pressed into the strip. Limiting to two classes lets us build a focused, well-annotated dataset (500 custom images, 682 bounding boxes) and validate detection rigorously — a broader taxonomy would require an order of magnitude more data and is listed as future work in §4.8.4.

What are the failure modes of manual inspection that motivate automation?

Three: (1) subjective judgment — different inspectors disagree on borderline defects; (2) operator fatigue — performance degrades within minutes of continuous visual work (Zheng 2021); (3) physical limit at production speed — humans cannot match the frame rate needed for high-speed rolling lines. Manual workflows also can't be audited at the per-sheet level the way an immutable scan log can.

Why computer vision + thermal IR instead of ultrasonic or laser-based NDT?

Cost, contact-free operation, and integrability. Ultrasonic NDT requires coupling fluid and contact transducers; laser scanners are accurate but expensive (~10× our budget) and slower at full-sheet coverage. CV with a USB webcam is software-bound, scales easily, and our full prototype is ~₱32k. Thermal IR adds subsurface sensitivity (dents whose thermal signature differs from surrounding material) without a second optical sensor, complementing — not replacing — the visual modality.

What gap in existing literature does your work fill?

Most published steel-surface CV work is single-modality and evaluated only on benchmark datasets like NEU-DET in lab conditions. §2.3 identifies three gaps: (1) limited multi-sensor fusion architectures integrated with industrial workflows; (2) high computational cost of accurate models on legacy factory hardware; (3) dataset bias — NEU-DET-trained models fail on real production lighting, vibration, and dust. Our contribution: a multi-sensor (RGB + thermal) fusion pipeline on consumer hardware with a working web UI and explicit ISO 9001/10012 traceability.

B. Research Questions & Objectives

5 questions

State all 5 research objectives in order.

(1) Acquire a dataset for training and verify against industry standards. (2) Detect scratches and dents via (2.1) deep learning and (2.2) optimal camera placement. (3) Develop a web application integrable into a pre-existing manufacturing system. (4) Develop a prototype demonstrating end-to-end defect detection in a controlled environment. (5) Evaluate against ISO 9001 and ISO 10012, with classification accuracy meeting the minimum threshold for steel sheet manufacturing.

Which ISO standards do you target, and why those two?

ISO 9001:2015 — Quality Management Systems (clauses 5.3 roles, 7.5 documented info, 8.5.2 traceability, 8.6 release, 8.7 nonconformity, 9.1.3 data analysis, 10.2 corrective action). ISO 10012:2003 — Measurement Management Systems (clauses 6.3 equipment, 7.1 metrological confirmation, 7.2 measurement process, 7.3 uncertainty). 9001 governs the workflow / audit trail; 10012 governs the measurement instruments — both required for a defensible inspection system.

RQ 2.2 asks about "optimal camera placement." How did you operationalize "optimal"?

Honestly — partially. We adopted a convergent mounting geometry with both sensors aimed at the inspection zone at comparable working distances to simplify ensemble calibration (§4.7). We didn't produce a quantitative working-distance / angle / FOV spec, which is a fair panelist criticism. Audit item G2: either soften wording to "suitable" or add concrete numbers (cm, degrees, mm²).

Why 5 objectives instead of consolidating?

Each objective maps to a distinct deliverable: (1) dataset, (2) model, (3) software, (4) hardware prototype, (5) compliance. Consolidating would force one deliverable to carry verification for another (e.g., merging 3 and 4 hides whether the web app exists independently of the hardware). The 5-objective structure makes the conclusion auditable — §4.7 has one paragraph per objective.

State the central research question in one sentence.

"Can a multi-sensor fusion system combining a USB webcam (YOLOv11) and an MLX90640 thermal camera detect steel surface scratches and dents with industry-acceptable accuracy on consumer-grade hardware, within an ISO 9001/10012-aligned web-based workflow?"

C. Scope & Delimitation

5 questions

What's the minimum and maximum defect size your system can detect?

5 mm minimum, 50 mm maximum. Established through controlled physical testing (§1.4). Classification thresholds: dent ≥ 3 mm = No Good, scratch ≥ 5 mm = No Good (per Definition of Terms).

At what conveyor speed was the system validated?

5 cm/s (§4.4.1, §4.8.2). Audit item G1: missing from Scope & Delimitation (§1.4) — should be added. Higher speeds produce motion blur and reduce thermal dwell time, degrading accuracy.

What defect classes are EXCLUDED, and why?

Rolled-in scale, pitting, laminar cracks, oxide streaks, edge seam defects — all common in industrial steel but absent from training (§4.8.4). Reason: scope discipline. A 500-image, 2-class dataset is rigorously annotated; adding more classes without more data would degrade per-class performance. We explicitly say the system should NOT be deployed to detect those types.

What environmental factors are outside scope?

Heat, dust, smoke, variable ambient lighting, mechanical vibration — acknowledged in §1.4 as influencing sensor data but outside our controlled-lab scope. Also explicitly excluded: thermal sensor calibration, lighting standardization across stations, design of industrial-grade cameras/mounts.

Why is thermal sensor calibration "outside scope"?

The MLX90640 is used relative to a running per-pixel EMA baseline — we look at temperature deviations, not absolute temperatures. Radiometric calibration to a blackbody isn't required for our anomaly rule (Equation 8). What IS required for production deployment is recalibration of the EMA baseline and soft voting weights at the site, which §5.3.2 recommends.

D. Related Literature

5 questions

What did Wen et al. (2024) establish that motivated the shift to deep learning?

A bibliometric trend analysis showing a significant increase in ML-based defect detection methods over 2021–2024, signaling the field's shift from statistical and texture-segmentation approaches toward deep learning. Audit item M1: cited in §1.1 but missing from References — add before defense.

Cite one paper that supports CNNs over rule-based methods for defect detection.

Qiu et al. (2021) — CNN-based systems outperform traditional rule-based approaches because they learn complex spatial patterns directly from high-resolution images (§2.2.3). Also missing from References (audit item M5). Also: Wang S. (2021) and Zheng et al. (2021) for CNN hierarchical feature representation.

Which fusion-related paper in your RRL most resembles your architecture?

Tsanousa et al. (2022) — review of multisensor data fusion in smart manufacturing. In References but never cited in body (audit item U6). Should be cited in §2.2 / §3.3. Also relevant: Liu, Z. et al. (2024) — multimodal framework with visual/acoustic/vibration signals.

How does Severstal dataset literature inform your work?

Li, Q. et al. (2026) showed pixel-wise annotations on Severstal improve spatial recognition of fine micro-cracks (§2.2.1). We don't use Severstal directly — we use NEU-DET — but the takeaway is that annotation granularity matters. Our Roboflow bounding boxes are coarser, acceptable for our 5–50 mm size range.

Why discuss Industry 4.0 / 5.0 in the synthesis?

To position the work — Industry 4.0 = digital monitoring + automation; Industry 5.0 = human-AI collaboration with explainability. Our system fits Industry 5.0 because the Quality Manager approval workflow keeps a human in the loop. Rosca et al. (2025) frames the same shift from post-production assessment to in-line continuous prediction.

E. Dataset & Annotation

6 questions

How many images in your custom-collected dataset?

500 annotated images, supplemented with a remapped subset of NEU-DET (§4.1, §5.1).

Train/val/test split?

350 train / 100 validation / 50 test (Table 8). Per-class: Train 298 NG + 52 Good; Val 85 NG + 15 Good; Test 42 NG + 8 Good. Total 425 NG + 75 Good = 500.

Total bounding-box annotations?

682 across two classes (Table 9): 487 train + 131 validation + 64 test. Multiple defect annotations per image account for the difference vs. the 500 image count.

What's the Good/No-Good balance in the test set, and why is that a problem?

8 Good vs. 42 No-Good (§4.8.1). Imbalance makes FDR highly sensitive: a single false positive moves FDR from 0% to 12.5% — exactly what happened (LIVE-003 surface-oxidation case). Balanced n ≥ 200 would produce statistically robust FDR/MDR estimates.

Why supplement with NEU-DET, and what's the correct citation?

To mitigate overfitting to our limited custom-collected dataset and improve generalization (§3.6.1). Correct citation: Song, K., & Yan, Y. (2013). A noise robust method based on completed local binary patterns for hot-rolled steel strip surface defects. Applied Surface Science, 285, 858–864. Audit item G3: missing from References.

What augmentations does Roboflow apply to the training set?

Contrast adjustment, horizontal and vertical flipping, rotational variation, and synthetic noise addition (§3.5.2, §3.6.2). Goal: expand effective dataset diversity to improve generalization without collecting more physical samples.

F. Models, Training, Hyperparameters

8 questions

Which 3 YOLO variants did you benchmark, and which was selected?

YOLOv8, YOLOv11, and YOLO26. YOLOv11 selected based on superior per-class F1 (Dent 0.877 / Scratch 0.902) and validation loss 0.754 at epoch 100 (§4.2).

What confidence threshold do you use at inference?

Paper says conf ≥ 0.40 (§3.4.7 Step 4, §3.8.10). CODE MISMATCH: hardware.py Config defaults to confidence: 0.3. Panelists will catch this. Either update the paper to 0.30 or update the code default to 0.40 before defense.

YOLOv11 mAP@0.5 and mAP@0.5:0.95?

mAP@0.5 = 0.904, mean F1 = 0.889 (§4.1, §5.1). mAP@0.5:0.95 averages AP across 10 IoU thresholds 0.50–0.95; report the value in Table 10. Both metrics exceed the 90% literature threshold for prototype systems.

Per-class F1 scores for YOLOv11?

Dent F1 = 0.877, Scratch F1 = 0.902 (Table 11, §5.1). Scratch slightly higher because linear features have a more consistent visual signature than depressions.

YOLOv11 validation loss and at what epoch?

Validation loss 0.754 at epoch 100 (§4.2, §5.1). Trained for 100 epochs with no divergence between train and validation loss, indicating generalization without overfitting.

Why YOLOv11 over the more mature YOLOv8?

Empirical — YOLOv11 beat YOLOv8 in our benchmark: mAP@0.5 0.904 vs. 0.791, F1 0.889 vs. 0.776 (§5.1). YOLOv11's improved backbone (C2PSA, C3k2 blocks) better handles small linear features like scratches at our image resolution. Maturity matters less than measured per-task performance.

What is YOLO26 and where can a panelist find its documentation?

Critical audit item: YOLO26 is never cited in the paper. Honest answer: it's a non-standard variant — you need to add either an Ultralytics docs URL, the Jocher et al. release notes, or rename to the variant you actually used. A panelist may ask "are you sure that's a real model?" — have a source link ready.

Where did training happen and why that platform?

Google Colab for GPU-accelerated training (§3.5.2), Roboflow for annotation and augmentation, VS Code for local web-app development. Colab removes local GPU dependency, Roboflow standardizes annotation, VS Code supports Python + Git workflows.

G. Fusion Logic & Equations

8 questions

Explain Equation 8 (thermal anomaly detection) in plain English.

A pixel is flagged anomalous when ALL three conditions hold across at least 3 consecutive frames: (1) temperature exceeds its running EMA baseline by > 2.5σ; (2) temperature exceeds the baseline by at least 1.0°C absolute; (3) state persists ≥ 3 frames. Two filters (statistical + absolute) prevent flagging sensor noise as defects. CODE MISMATCH: hardware.py uses sigma=3.5, floor=3.0°C, persistent_frames=5 — all stricter than the paper.

What is α in the thermal baseline EMA, and why that value?

α = 0.02. Small alpha = slow baseline drift, so we don't suppress real thermal events by absorbing them into the baseline. Trade-off: longer recovery after a true ambient shift, but lab conditions are stable, so 0.02 is appropriate (§3.4.7 Step 3).

Why a 20-frame warmup before scoring begins?

The EMA baseline needs to stabilize before deviations are meaningful. With α = 0.02, a 20-frame warmup brings the baseline within a few percent of true ambient. At ~8 Hz frame rate that's ~2.5 seconds of startup — acceptable for a system that runs for hours.

Why w_webcam = 0.65 and w_thermal = 0.35? How were these chosen?

Empirically — webcam achieved 90.0% per-sensor accuracy, thermal 80.0% on validation (§3.8.9). Weights bias fusion toward the reliable modality while keeping a non-trivial 35% from thermal because it catches subsurface dents the webcam misses. The constraint w_webcam + w_thermal = 1 keeps P_fused interpretable as a probability.

Explain Equation 9 (soft voting ensemble) and its constraint.

P_fused = w_webcam · P_webcam + w_thermal · P_thermal, subject to w_webcam + w_thermal = 1. Weighted average of two per-modality confidence scores. The constraint ensures P_fused stays in [0,1] when both inputs are probabilities. A defect verdict issues when P_fused exceeds the decision threshold τ = 0.50.

Walk through Equation 10 for: 2 detections, no thermal anomaly, 3 hotspots, 1 fusion hit. Final severity?

sev = min(2·18, 50) + 25·0 + min(3·6, 20) + min(1·8, 15)
= 36 + 0 + 18 + 8 = 62
sev_final = min(62, 100) = 62 → POSSIBLE DEFECT bin (40 ≤ 62 < 70).

What's the decision threshold τ?

τ = 0.50 for binary Good/No-Good on the fused probability P_fused (§3.4.7 Step 5, §3.8.9). Note: hardware.py implements verdict via severity bins, not directly via τ on a probability — be ready to explain that the paper presents the conceptual decision rule and the code implements an equivalent severity-bin scheme.

What are the 4 verdict bins your system produces?

Paper (Table 7): PASS, POSSIBLE DEFECT, THERMAL ANOMALY, DEFECT CONFIRMED. CODE MISMATCH: hardware.py uses PASS, INSPECT, POSSIBLE DEFECT, DEFECT CONFIRMED with thresholds 15 / 40 / 70. "INSPECT" appears instead of "THERMAL ANOMALY". Align both before defense.

H. Hardware & System Architecture

6 questions

What are the two sensor modalities and the role of each?

(1) USB webcam — primary visual detection of scratches and dents via YOLOv11 bounding boxes under LED illumination. (2) MLX90640 thermal infrared camera — captures spatial temperature distribution, detects anomalies that may indicate subsurface defects whose thermal signature differs from surrounding material.

Thermal sensor model and resolution?

Adafruit MLX90640 (Melexis chip), resolution 32×24 pixels = 768 thermal pixels per frame. The EMA is maintained per pixel for all 768.

What microcontroller bridges the thermal sensor to the host?

ESP32-S3 Development Board (Table 4). Reads the MLX90640 via I²C and streams the binary 32×24 grid to the Mini-PC over USB serial.

What baud rate is the thermal serial link?

921,600 baud (§3.4.7 Step 2; hardware.py baud_rate: 921600). High baud needed because each thermal frame is 768 float values; lower rates would bottleneck below the 8 Hz sensor frame rate.

Role of the Mini-PC vs. the ESP32-S3?

Mini-PC: runs hardware.py — YOLOv11 inference, fusion logic, severity scoring, matplotlib dashboard, CSV logging, HTTPS upload to Supabase. ESP32-S3: dedicated to thermal acquisition — reads MLX90640 via I²C, streams frames serially. Separating concerns prevents thermal acquisition jitter when the Mini-PC is busy with YOLO inference.

Why a Mini-PC and not a Raspberry Pi or full desktop?

Mini-PC has GPU/CPU headroom for YOLOv11 inference at 9.3 ms/image (a Pi would struggle), small footprint for production-line mounting, runs Windows with standard Ultralytics/OpenCV/PySerial stack, ₱10,000 vs. ₱20k+ for a workstation (Table 4).

I. Software Stack & Web App (Steel IRIS)

6 questions

Deployed URL?

kirabase.net (§4.7 Result 3). Be ready to demo if asked.

Three-tier architecture — what's each tier?

(1) Frontend: Vite-built React app on Vercel. (2) Backend: Supabase PostgreSQL (auth + row-level security + REST + Realtime). (3) Hardware acquisition layer: Python (hardware.py) on the Mini-PC, communicating with backend over HTTPS.

How many functional modules in Steel IRIS?

Nine: authentication, live inspection dashboard, manual review workflow, analytics dashboard, kanban workflow board, steel sheet records, scan log records, inspection records, equipment registry (§4.5, §4.7 Result 3).

The 5 RBAC roles, and the most restricted?

(1) Operator — most restricted, can only create scan logs + view own history. (2) Inspector — performs inspections, accesses reports, annotates. (3) Quality Manager — approves/rejects/reworks. (4) Admin — equipment registry, user management. (5) Super Admin — full system + cross-org access (Table 16).

How is row-level security enforced?

Supabase RLS policies on PostgreSQL tables enforce role-based access at the database layer, independently of the frontend. Even if a malicious actor bypasses the React UI and hits the REST API directly, RLS still rejects unauthorized queries (§4.5.6). Critical for ISO 9001 §5.3 — authority can't be circumvented client-side.

Why React + Vite + Supabase + Vercel — what are the trade-offs?

React: mature ecosystem, shadcn/ui components. Vite: fast dev loop, smaller bundles than Webpack. Supabase: managed Postgres + auth + RLS + realtime — replaces a full backend service. Vercel: zero-config deploy. Trade-off: Supabase free tier has bandwidth limits; production line would need self-hosted Postgres or Pro plan.

J. Results & Metrics

6 questions

Overall classification accuracy of the fused pipeline?

96.0% on the 50-image held-out test set (Table 15, §4.6.3, §5.1). Exceeds the 90% literature threshold by 6 points.

MDR and why it's safety-critical?

MDR = 2.4% (1 missed defect out of 42 No-Good test sheets). MDR is the rate at which actual defective sheets pass as Good. Safety-critical because missed defects propagate downstream — automotive panels, structural members — that fail in service. Our 2.4% is well below the 10% ceiling.

FDR and how does it compare to the 10% threshold?

FDR = 12.5% — marginally exceeds the 10% acceptance criterion (§4.6.3, §5.1). 1 false positive out of 8 Good sheets = 12.5%. Tiny denominator makes the metric brittle. Future work proposes a fusion veto rule (reject No-Good when thermal Δ < 1.0°C and YOLOv11 conf < 0.55).

What caused the FDR threshold miss?

A single sheet — LIVE-003 — with a surface-oxidation patch that produced borderline readings on both sensors, triggering a false No-Good verdict. Cause identified, mitigation proposed, transparency maintained that the FDR ceiling isn't satisfied at this prototype scale (§4.4, §4.6.3).

Average end-to-end pipeline latency per inspection cycle?

106 ms per cycle (≈9.4 fps), well within the 200 ms per-sheet budget (§4.7 Result 4, §5.1). Conclusion 4 also reports 111 ms processing time per image — be ready to reconcile both numbers (106 = full pipeline, 111 = pipeline + display rendering).

YOLOv11 inference speed in ms/image and equivalent FPS?

9.3 ms per image (Conclusion 2, §5.2) — ~107 fps inference-only. Pipeline-end fps is lower (9.4 fps) because it includes thermal acquisition, fusion, severity scoring, dashboard rendering, and HTTPS upload.

K. ISO Compliance

4 questions

How many ISO clauses audited, and what's the breakdown?

12 clauses audited (Table 18, §4.6.4). 8 Compliant, 3 Partially Compliant, 1 Not Compliant. Compliant cluster is in operational record-keeping, traceability, role authority, workflow control. Partial findings have technical capability in place but lack formal artifacts (calibration certificates, scheduled recalibration, root-cause-analysis records).

Which single clause shows "Not Compliant" and why?

ISO 10012 §7.3 — Measurement Uncertainty. No formal uncertainty budget computed for defect-size measurements. Closing the gap requires R&R studies for each sensor modality plus per-modality uncertainty propagation analysis — listed in §5.3.1 as near-term improvement.

How does Steel IRIS satisfy ISO 9001 §8.7 (control of nonconforming outputs)?

The Kanban workflow board (Figure 12) has explicit Rejected and Rework columns. A No Good or Rework verdict must be reviewed and confirmed by a Quality Manager before the sheet advances — no automated bypass. Nonconforming material can't be released as conforming without a human decision recorded in an immutable audit trail (§4.6.1, Table 18).

What does the Equipment Registry satisfy in ISO 10012?

§7.2 — Confirmation of measuring equipment. The Equipment Registry (Figure 17) stores each instrument's identifier, model, location, and operational status, providing a centralized auditable record. Also partially supports §6.3 (Material resources for measurement equipment).

L. Limitations & Edge Cases

6 questions

Why does the webcam underperform on corner-located defects?

Three factors (§4.8.3): (1) optical FOV introduces barrel distortion at the image periphery, deforming linear features like scratches; (2) overhead LED panel produces uneven illumination — corners receive less direct light; (3) training corpus had higher density of centrally-positioned defect annotations → annotation bias toward central detection. Mitigation: lens distortion correction, directional side-lighting, corner-defect augmentation (§5.3.1).

What happens to accuracy as conveyor speed exceeds 5 cm/s?

Two failure modes (§4.4.1): (1) motion blur reduces YOLOv11 confidence on scratches because linear features smear in the direction of motion; (2) reduced thermal dwell time — the MLX90640 needs multiple frames over the same region to pass the 3-frame persistence rule. Mitigation now: manual conveyor-stop button. Future: automatic speed adaptation when fused probability falls in POSSIBLE DEFECT range.

Why is n=50 test set a real limitation, and what would be statistically robust?

FDR is computed over only 8 Good sheets, so a single false positive moves the metric by 12.5 percentage points (§4.8.1). FDR/MDR not stable at this scale. Recommended: n ≥ 200 with balanced Good/No-Good (~50/50) — brings FDR sensitivity to per-FP increments of ~1%, statistically robust for production claims.

What real-factory variables were not simulated in your testbed?

§4.8.2: variable ambient lighting, temperature fluctuations affecting thermal baseline, mechanical vibration shifting sensor alignment, dust or oil contamination obscuring/mimicking defects, production speeds above 5 cm/s. §5.3.2 recommends site-specific re-evaluation with recalibration of thermal EMA baseline and soft voting weights before any production deployment.

Why might the thermal modality outperform vision on subsurface dents?

Subsurface dents may not produce a strong visible boundary or shadow under overhead LED illumination, so YOLOv11 doesn't see them. But they change local heat-conduction profile — under steady thermal load, the dented area equilibrates to a slightly different temperature than surrounding material. The MLX90640 sees that delta and Equation 8 flags it. This is exactly the case the soft voting ensemble was designed to catch.

"Why can't your system detect other defects like pitting or oxide scale?" — what's your answer?

Honest answer: because we didn't train it on those classes. The KENSA dataset has zero annotated examples of pitting, rolled-in scale, laminar cracks, oxide streaks, or edge seam defects (§4.8.4). The model would either miss them or misclassify them as scratches/dents. Explicit scope limitation, not a design failure — the architecture (YOLOv11 + thermal fusion) generalizes to additional classes; the dataset just needs expansion + retraining.

🎓 Defense Prep Showdown

🎯 Game Mode

📖 Practice Mode

🏆 Live Leaderboard Connecting… ↻

Nice run, Player!

A. Background & Motivation

B. Research Questions & Objectives

C. Scope & Delimitation

D. Related Literature

E. Dataset & Annotation

F. Models, Training, Hyperparameters

G. Fusion Logic & Equations

H. Hardware & System Architecture

I. Software Stack & Web App (Steel IRIS)

J. Results & Metrics

K. ISO Compliance

L. Limitations & Edge Cases

🏆 Live Leaderboard Connecting…