nerdexam
GoogleGoogle

PROFESSIONAL-DATA-ENGINEER · Question #255

PROFESSIONAL-DATA-ENGINEER Question #255: Real Exam Question with Answer & Explanation

The correct answer is D: Create a pseudonym by replacing Pll data with a cryptographic format-preserving token. D is correct because cryptographic format-preserving tokenization (using DLP's CryptoReplaceFfxFpeConfig) is deterministic - the same PII input always produces the same token, so JOIN keys remain consistent across tables. The original data is never stored in plaintext anywhere in

Submitted by ravi_2018· Mar 30, 2026Designing data processing systems

Question

You are building a teal-lime prediction engine that streams files, which may contain Pll (personal identifiable information) data, into Cloud Storage and eventually into BigQuery. You want to ensure that the sensitive data is masked but still maintains referential Integrity, because names and emails are often used as join keys. How should you use the Cloud Data Loss Prevention API (DLP API) to ensure that the Pll data is not accessible by unauthorized individuals?

Options

  • ACreate a pseudonym by replacing the Pll data with cryptogenic tokens, and store the non-tokenized data in a locked-down button.
  • BRedact all Pll data, and store a version of the unredacted data in a locked-down bucket
  • CScan every table in BigQuery, and mask the data it finds that has Pll
  • DCreate a pseudonym by replacing Pll data with a cryptographic format-preserving token

Explanation

D is correct because cryptographic format-preserving tokenization (using DLP's CryptoReplaceFfxFpeConfig) is deterministic - the same PII input always produces the same token, so JOIN keys remain consistent across tables. The original data is never stored in plaintext anywhere in the pipeline, satisfying both the masking and referential integrity requirements simultaneously.

Why the others fail:

  • A ("cryptogenic tokens") stores the original unmasked data in a separate bucket, creating a second attack surface - the locked-down bucket can still be breached, so PII remains accessible.
  • B (redaction + unredacted bucket) destroys the data's usefulness as a join key - [REDACTED] cannot be joined - and again stores raw PII in a bucket.
  • C (scan BigQuery tables) is reactive and operates after data has already landed unmasked in BigQuery; it also doesn't address the streaming ingestion phase or guarantee join-key consistency.

Memory tip: Associate referential integrity with deterministic tokenization - if you need the same value to produce the same masked result every time (for JOINs), you need a cryptographic token, not random masking or redaction. The word "pseudonym" in the answers is your cue: a good pseudonym is consistent, but only D applies it cryptographically without storing the original.

Topics

#Cloud DLP#PII Masking#Data De-identification#Referential Integrity

Community Discussion

No community discussion yet for this question.

Full PROFESSIONAL-DATA-ENGINEER PracticeBrowse All PROFESSIONAL-DATA-ENGINEER Questions