A comprehensive pipeline for detecting and removing AI-generated watermarks from text using multiple detection models, paraphrasing, back-translation, and sentence restructuring techniques.
- Multi-Model Watermark Detection: Uses multiple transformer models (T5, GPT-2, Llama-2, Falcon, DeepSeek) to detect watermarks
- ML-Based Decision Making: Combines detection results using machine learning for final prediction
- Text Paraphrasing: Rewrites text using contextual word embeddings to remove watermarks
- Synonym Replacement: Replaces words with synonyms using WordNet for text variation
- Back-Translation: Translates text through intermediate languages and back to English for additional transformation
- Sentence Restructuring: Restructures sentences by converting between active/passive voice
- Automatic File Management: Saves all outputs to the
data/folder - Result Tracking: Saves detection results before and after processing for comparison
watermark-detector-removel/
├── ai_detector/ # Watermark detection modules
│ ├── base.py # Core detection logic
│ ├── config.py # Model configurations
│ ├── cal_perc_model.py # ML decision model
│ ├── model_building.py # Feature extraction and model building
│ ├── normalizer.py # Text normalization utilities
│ ├── homoglyphs.py # Homoglyph detection and handling
│ ├── ensemble_ai_detector.pkl # Trained ensemble model
│ └── ensemble_ai_detector_scaler.pkl # Feature scaler
├── paraphraser/ # Text paraphrasing module
│ └── paraphsing.py
├── translator/ # Back-translation module
│ └── translation.py
├── Sentence_Restructuring/ # Sentence restructuring module
│ └── restructuring.py
├── synonym_replacement/ # Synonym replacement module
│ └── synonym_replacement.py
├── data/ # Input/output files
│ ├── data.txt # Input text (required)
│ ├── paraphrase.txt # Paraphrased output (generated)
│ ├── translated.txt # Back-translated output (generated)
│ ├── restructured.txt # Restructured output (generated)
│ └── synonym_replaced.txt # Synonym-replaced output (generated)
├── models/ # Model files (backup location)
│ ├── ensemble_ai_detector.pkl
│ └── ensemble_ai_detector_scaler.pkl
├── results/ # Detection results
│ ├── before_detection.json # Detection results before processing
│ ├── after_detection.json # Detection results after processing
│ └── comparison_summary.json # Comparison summary
├── notebooks/ # Jupyter notebooks for development
│ ├── ai_detector.ipynb
│ └── paraphrasing.ipynb
├── test/ # Test files
│ └── test.py
├── main.py # Main execution script
├── setup_nltk.py # NLTK data setup script (run once before main.py)
└── requirements.txt # Python dependencies
git clone https://github.com/osman-haider/watermark-detector-removel.git
cd watermark-detector-removelpython -m venv venvWindows:
venv\Scripts\activateLinux/Mac:
source venv/bin/activatepip install -r requirements.txtpython -m spacy download en_core_web_smRun the setup script to download all required NLTK data packages:
python setup_nltk.pyThis will download:
punkt: Sentence tokenizerpunkt_tab: Updated punkt tokenizer datawordnet: WordNet lexical database for synonym replacementaveraged_perceptron_tagger_eng: Part-of-speech tagger for English
Note: This step is required before running the main application. The script will check if packages are already installed and skip them if found.
This project uses gated models (like meta-llama/Llama-2-7b-hf) that require authentication. You need to set your Hugging Face token as an environment variable.
- Go to https://huggingface.co/settings/tokens
- Create a new token (or use an existing one)
- Make sure you have access to the gated models you need:
- Request access at https://huggingface.co/meta-llama/Llama-2-7b-hf
Option 1: Set for current session only
$env:HF_TOKEN = "your_token_here"Option 2: Set permanently (for current user)
[System.Environment]::SetEnvironmentVariable("HF_TOKEN", "your_token_here", "User")Option 3: Set permanently (system-wide)
[System.Environment]::SetEnvironmentVariable("HF_TOKEN", "your_token_here", "Machine")After setting permanently, restart your terminal/PowerShell.
Option 1: Set for current session only
set HF_TOKEN=your_token_hereOption 2: Set permanently
setx HF_TOKEN "your_token_here"Create a .env file in the project root:
HF_TOKEN=your_huggingface_token_hereThe code will automatically load the token from the .env file using python-dotenv.
The code also checks for these environment variable names:
HUGGINGFACE_TOKENHUGGING_FACE_HUB_TOKEN
echo $env:HF_TOKENImportant: Make sure you've completed all setup steps above, including running python setup_nltk.py to download required NLTK data.
Place your text in data/data.txt:
# Create the file if it doesn't exist
echo "Your text here" > data/data.txtpython main.pyAfter execution, check the following folders:
data/ folder:
paraphrase.txt- Paraphrased version of the inputsynonym_replaced.txt- Synonym-replaced version of the inputtranslated.txt- Back-translated version (English → Intermediate Language → English)restructured.txt- Sentence-restructured version
results/ folder:
before_detection.json- Watermark detection results before processingafter_detection.json- Watermark detection results after processingcomparison_summary.json- Summary comparing before and after detection results
The script processes text through the following steps:
- Load Input Text - Reads text from
data/data.txt - Watermark Detection (Before) - Detects watermarks using multiple transformer models
- ML Decision (Before) - Computes final decision using ML ensemble model
- Save Initial Detection Results - Saves detection results to
results/before_detection.json - Paraphrasing - Rewrites text using contextual word embeddings (BERT-based)
- Save Paraphrased Text - Saves to
data/paraphrase.txt - Back-Translation - Translates text through an intermediate language and back to English
- Save Translated Text - Saves to
data/translated.txt - Synonym Replacement - Replaces words with synonyms using WordNet
- Save Synonym Replaced Text - Saves to
data/synonym_replaced.txt - Sentence Restructuring - Restructures sentences by converting between active/passive voice
- Save Restructured Text - Saves to
data/restructured.txt - Watermark Detection (After) - Detects watermarks in the final processed text
- ML Decision (After) - Computes final decision for processed text
- Save Final Detection Results - Saves detection results to
results/after_detection.json - Comparison Analysis - Compares before and after detection results
- Save Comparison Summary - Saves comparison to
results/comparison_summary.json
- The translation module processes text sentence-by-sentence to handle long texts efficiently
- All outputs overwrite previous files on each run
- The pipeline includes comprehensive error handling and progress logging
- Make sure you have internet access for downloading models and using translation APIs
- Make sure your Hugging Face token is set correctly
- Request access to gated models at their respective Hugging Face pages
- Verify token has read permissions
- Run:
python -m spacy download en_core_web_sm
- Run:
python setup_nltk.pyto download all required NLTK data packages - Make sure you have internet access during the setup
- The script automatically handles long texts by processing sentence-by-sentence
- If errors persist, check your internet connection and Google Translate API availability