Hypothesis

Kirchenbauer et al. (2023a) propose a watermark that is injected at the token level. At each time step of the generation, the vocabulary V is pseudorandomly partitioned into a “green list” and a “red list”. The random seed for partition is computed by a hash of the previously generated token. A globally fixed bias parameter $\delta > 0$ is added to the logit of each green list token so that the LM is induced to generate more green list tokens. The watermark is detected by conducting one proportion z-test on the number of green list tokens in the generated text.

Because of the token-level design of the watermark algorithm, perturbing a token $w_t$ in a generated sequence $w_{1:T}$ through paraphrasing would change the green list for token $w_{t+1}$. As a result, a green token $w_{t+1}$ could be considered red after the green list has changed, which undermines the detectability of the watermark (Krishna et al., 2023). Moreover, because the watermark changes logits directly, it can degrade the quality of generated text (Fu et al., 2023)

To address the above problem, our lab recently published a paper (Hou et al., 2023) that attempts to solve this paraphrase attack problem using semantic-based watermarking, which is injected by first mapping candidate sentences into embeddings, then dividing the semantic space through locality-sensitive hashing, and keep rejection sampling from the LM until generating sentences with valid region embeddings.

However, I highly suspect that a summarization attack could be more “dangerous” than a paraphrase attack, raising new challenges to both token-level and semantic-level watermarking. Here are two reasons/hypotheses:

Token-level: The summarization results have lower averaged spike-entropy compared to the regular generation or paraphrase results due to its brevity and relatively “fixed” syntax. This potentially leads to more tokens from the red list in the generated sequence, making it difficult to detect the watermark at the token-level.
Semantic-level: The summarization task might generate many semantically closed candidate sentences, which could unintentionally bias the splitting of semantic space. Consequently, any summarized text, even human-written, from the same prompt might be mistakenly classified as model-generated. This effect could be analogous to the impact of a large $\gamma \approx 1$ on token-level watermarking.

Because of time limit, I only tested the first hypothesis regarding the spike-entropy comparison between summarization and generation tasks, and their effects on token-level watermark detection.

Experiment Results

Code and dataset is available at https://github.com/yining610/lm_watermarking/tree/main/summarization

I use t5-base and cnn summarization datasets in the following experiments. All the numbers in the table are computed from 500 examples from the validation dataset, which is similar to the scale of the original paper.

To reproduce the results, please refer to summarization/README.md

Spike Entropy Comparison

T5-base (raw) / greedy	Generation	Summarization
Avg Spike Entropy	0.8976	0.8922
Rouge	0.0530	0.1896
T5-base (trained) / greedy	Generation	Summarization
Avg Spike Entropy	0.9414	0.8795
Rouge	0.0673	0.1968
T5-base (trained) / Sampling	Generation	Summarization
Avg Spike Entropy	0.9458	0.8812
Rouge	0.0590	0.1753

Note: The rouge scores here are not comparable, given they are computed from different labels on different tasks. I use rouge score to guide and monitor training

In the above two tables, we compared the spike-entropy of the generation and summarization tasks before and after finetuning. It is clear that they have a really closed average spike entropy ($\Delta = 0.0054$) using the raw model, while finetuning on their corresponding datasets increases this difference ($\Delta = 0.0619$) by 11 times . The generation task consistently has a larger average spike entropy than summarization, which partially verifies our first statement: The summarization results have lower averaged spike-entropy compared to the regular generation.

Watermark Detection Test

Watermark for Seq2Seq model: We use seeding_scheme=selfhash in our experiments. Because it requires context_width > 1, which is not applicable to the seq2seq model (e.g., t5-base in our experiment) if we use the original WatermarkLogitsProcessor in extended_watermark_processor.py. Because for encoder-decoder architecture, model.generate() only passes decoder_input_ids to the logits_process interface. This means that the length of decoder_input_ids has to be less than context_witdth for the first few tokens and no seed will be generated under this scheme.

To address this problem, I prepend the encoder input ids to the corresponding decoder input ids to ensure it has enough length for seed generation. However, a more fine-grained seeding solution for seq2seq models is still worthy of exploration.

Below are the watermark strengths of these two finetuned task models (TP: not to reject the null hypothesis until z is greater than the given threshold)

$\delta=2, \gamma=0.25, z=3, \text{sampling}$	TP	FN	Green Token Ratio	Avg z-score
Generation	0.992	0.008	31%	0.5528
Summarization	0.99	0.01	27.7%	0.2479
$\delta=2, \gamma=0.5, z=3, \text{sampling}$	TP	FN	Green Token Ratio	Avg z-score
Generation	1	0	60%	0.8071
Summarization	1	0	54%	0.3778

Statistically, it is trivial to say that summarization makes watermark more difficult to detect compared to generation, as it has a low green token ratio close to $\gamma$, which makes it look like a human-written. However, it also has a low average z-score, which might be attributed to its shorter generated text. Thus, whether z-test is a suitable detection technique for summarization becomes a new question that needs to be resolved in the future.

Because of the time limit, I only conducted experiments on limited sets of parameters and models. However, the code should now be able to handle scaled experiments. Once I have more free time in the near future, I would like to do more experiments to make the above claims more plausible.

Conclusion

We proposed two hypotheses about the watermark on the summarization task and simply verified them by comparing with the generation task:

Summarization produces lower spike-entropy text
Summarization makes watermark detection more hard because it has a green token ratio closer to the $\gamma$ value

During this process, we also found new challenges in watermarking for summarization, including the susceptibility of the z-test and the incompatibility of the original watermark processor to the seq2seq models, which should be further discussed in future works.