Watermark Detector and Removal Pipeline

A comprehensive pipeline for detecting and removing AI-generated watermarks from text using multiple detection models, paraphrasing, back-translation, and sentence restructuring techniques.

Features

Multi-Model Watermark Detection: Uses multiple transformer models (T5, GPT-2, Llama-2, Falcon, DeepSeek) to detect watermarks
ML-Based Decision Making: Combines detection results using machine learning for final prediction
Text Paraphrasing: Rewrites text using contextual word embeddings to remove watermarks
Synonym Replacement: Replaces words with synonyms using WordNet for text variation
Back-Translation: Translates text through intermediate languages and back to English for additional transformation
Sentence Restructuring: Restructures sentences by converting between active/passive voice
Automatic File Management: Saves all outputs to the data/ folder
Result Tracking: Saves detection results before and after processing for comparison

Project Structure

watermark-detector-removel/
├── ai_detector/              # Watermark detection modules
│   ├── base.py              # Core detection logic
│   ├── config.py            # Model configurations
│   ├── cal_perc_model.py    # ML decision model
│   ├── model_building.py    # Feature extraction and model building
│   ├── normalizer.py        # Text normalization utilities
│   ├── homoglyphs.py        # Homoglyph detection and handling
│   ├── ensemble_ai_detector.pkl          # Trained ensemble model
│   └── ensemble_ai_detector_scaler.pkl   # Feature scaler
├── paraphraser/              # Text paraphrasing module
│   └── paraphsing.py
├── translator/               # Back-translation module
│   └── translation.py
├── Sentence_Restructuring/  # Sentence restructuring module
│   └── restructuring.py
├── synonym_replacement/      # Synonym replacement module
│   └── synonym_replacement.py
├── data/                     # Input/output files
│   ├── data.txt             # Input text (required)
│   ├── paraphrase.txt       # Paraphrased output (generated)
│   ├── translated.txt       # Back-translated output (generated)
│   ├── restructured.txt     # Restructured output (generated)
│   └── synonym_replaced.txt # Synonym-replaced output (generated)
├── models/                   # Model files (backup location)
│   ├── ensemble_ai_detector.pkl
│   └── ensemble_ai_detector_scaler.pkl
├── results/                  # Detection results
│   ├── before_detection.json    # Detection results before processing
│   ├── after_detection.json     # Detection results after processing
│   └── comparison_summary.json  # Comparison summary
├── notebooks/                # Jupyter notebooks for development
│   ├── ai_detector.ipynb
│   └── paraphrasing.ipynb
├── test/                     # Test files
│   └── test.py
├── main.py                   # Main execution script
├── setup_nltk.py             # NLTK data setup script (run once before main.py)
└── requirements.txt          # Python dependencies

Installation

1. Clone the repository

git clone https://github.com/osman-haider/watermark-detector-removel.git
cd watermark-detector-removel

2. Create a virtual environment

python -m venv venv

Windows:

venv\Scripts\activate

Linux/Mac:

source venv/bin/activate

3. Install dependencies

pip install -r requirements.txt

4. Install spaCy language model

python -m spacy download en_core_web_sm

5. Download NLTK data

Run the setup script to download all required NLTK data packages:

python setup_nltk.py

This will download:

punkt: Sentence tokenizer
punkt_tab: Updated punkt tokenizer data
wordnet: WordNet lexical database for synonym replacement
averaged_perceptron_tagger_eng: Part-of-speech tagger for English

Note: This step is required before running the main application. The script will check if packages are already installed and skip them if found.

Setup

Hugging Face Authentication

This project uses gated models (like meta-llama/Llama-2-7b-hf) that require authentication. You need to set your Hugging Face token as an environment variable.

Getting your Hugging Face Token

Go to https://huggingface.co/settings/tokens
Create a new token (or use an existing one)
Make sure you have access to the gated models you need:
- Request access at https://huggingface.co/meta-llama/Llama-2-7b-hf

Setting the Token (Windows PowerShell)

Option 1: Set for current session only

$env:HF_TOKEN = "your_token_here"

Option 2: Set permanently (for current user)

[System.Environment]::SetEnvironmentVariable("HF_TOKEN", "your_token_here", "User")

Option 3: Set permanently (system-wide)

[System.Environment]::SetEnvironmentVariable("HF_TOKEN", "your_token_here", "Machine")

After setting permanently, restart your terminal/PowerShell.

Setting the Token (Windows Command Prompt)

Option 1: Set for current session only

set HF_TOKEN=your_token_here

Option 2: Set permanently

setx HF_TOKEN "your_token_here"

Using .env file (Recommended)

Create a .env file in the project root:

HF_TOKEN=your_huggingface_token_here

The code will automatically load the token from the .env file using python-dotenv.

Alternative Environment Variable Names

The code also checks for these environment variable names:

HUGGINGFACE_TOKEN
HUGGING_FACE_HUB_TOKEN

Verify Token is Set

echo $env:HF_TOKEN

Usage

Important: Make sure you've completed all setup steps above, including running python setup_nltk.py to download required NLTK data.

1. Prepare Input Text

Place your text in data/data.txt:

# Create the file if it doesn't exist
echo "Your text here" > data/data.txt

2. Run the Pipeline

python main.py

3. Check Outputs

After execution, check the following folders:

data/ folder:

paraphrase.txt - Paraphrased version of the input
synonym_replaced.txt - Synonym-replaced version of the input
translated.txt - Back-translated version (English → Intermediate Language → English)
restructured.txt - Sentence-restructured version

results/ folder:

before_detection.json - Watermark detection results before processing
after_detection.json - Watermark detection results after processing
comparison_summary.json - Summary comparing before and after detection results

Pipeline Steps

The script processes text through the following steps:

Load Input Text - Reads text from data/data.txt
Watermark Detection (Before) - Detects watermarks using multiple transformer models
ML Decision (Before) - Computes final decision using ML ensemble model
Save Initial Detection Results - Saves detection results to results/before_detection.json
Paraphrasing - Rewrites text using contextual word embeddings (BERT-based)
Save Paraphrased Text - Saves to data/paraphrase.txt
Back-Translation - Translates text through an intermediate language and back to English
Save Translated Text - Saves to data/translated.txt
Synonym Replacement - Replaces words with synonyms using WordNet
Save Synonym Replaced Text - Saves to data/synonym_replaced.txt
Sentence Restructuring - Restructures sentences by converting between active/passive voice
Save Restructured Text - Saves to data/restructured.txt
Watermark Detection (After) - Detects watermarks in the final processed text
ML Decision (After) - Computes final decision for processed text
Save Final Detection Results - Saves detection results to results/after_detection.json
Comparison Analysis - Compares before and after detection results
Save Comparison Summary - Saves comparison to results/comparison_summary.json

Notes

The translation module processes text sentence-by-sentence to handle long texts efficiently
All outputs overwrite previous files on each run
The pipeline includes comprehensive error handling and progress logging
Make sure you have internet access for downloading models and using translation APIs

Troubleshooting

Error: "Cannot access gated repo"

Make sure your Hugging Face token is set correctly
Request access to gated models at their respective Hugging Face pages
Verify token has read permissions

Error: "spacy model not found"

Run: python -m spacy download en_core_web_sm

Error: "NLTK data not found" or "LookupError: punkt"

Run: python setup_nltk.py to download all required NLTK data packages
Make sure you have internet access during the setup

Translation errors with long text

The script automatically handles long texts by processing sentence-by-sentence
If errors persist, check your internet connection and Google Translate API availability

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Watermark Detector and Removal Pipeline

Features

Project Structure

Installation

1. Clone the repository

2. Create a virtual environment

3. Install dependencies

4. Install spaCy language model

5. Download NLTK data

Setup

Hugging Face Authentication

Getting your Hugging Face Token

Setting the Token (Windows PowerShell)

Setting the Token (Windows Command Prompt)

Using .env file (Recommended)

Alternative Environment Variable Names

Verify Token is Set

Usage

1. Prepare Input Text

2. Run the Pipeline

3. Check Outputs

Pipeline Steps

Notes

Troubleshooting

Error: "Cannot access gated repo"

Error: "spacy model not found"

Error: "NLTK data not found" or "LookupError: punkt"

Translation errors with long text

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
Sentence_Restructuring		Sentence_Restructuring
ai_detector		ai_detector
data		data
docs		docs
notebooks		notebooks
paraphraser		paraphraser
results		results
synonym_replacement		synonym_replacement
translator		translator
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
setup_nltk.py		setup_nltk.py

Folders and files

Latest commit

History

Repository files navigation

Watermark Detector and Removal Pipeline

Features

Project Structure

Installation

1. Clone the repository

2. Create a virtual environment

3. Install dependencies

4. Install spaCy language model

5. Download NLTK data

Setup

Hugging Face Authentication

Getting your Hugging Face Token

Setting the Token (Windows PowerShell)

Setting the Token (Windows Command Prompt)

Using .env file (Recommended)

Alternative Environment Variable Names

Verify Token is Set

Usage

1. Prepare Input Text

2. Run the Pipeline

3. Check Outputs

Pipeline Steps

Notes

Troubleshooting

Error: "Cannot access gated repo"

Error: "spacy model not found"

Error: "NLTK data not found" or "LookupError: punkt"

Translation errors with long text

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages