54 lines
2.8 KiB
Markdown
54 lines
2.8 KiB
Markdown
# check_pa
|
|
|
|
[](https://git.ccc-p.org/cyroxx/check_pa/actions/workflows/python-tests.yml)
|
|
|
|
Monitor the Berlin Perso/Passport portal, crack the audio CAPTCHA with Whisper, and notify a webhook when fresh status information drops.
|
|
|
|
## Features
|
|
- Automates Firefox with Selenium to reach the Berlin appointment status page
|
|
- Downloads and transcribes audio CAPTCHAs via Whisper, falling back between attempts
|
|
- Normalizes the returned status text and emits structured data to any HTTP webhook
|
|
- Includes tooling for collecting captcha samples and benchmarking transcription quality
|
|
|
|
## Requirements
|
|
- Python 3.12
|
|
- Firefox + `geckodriver` in `$PATH` for Selenium
|
|
- `ffmpeg` (needed by `openai-whisper`)
|
|
- Optional: Tesseract OCR if you experiment with the image-based approach in `ocr/`
|
|
- Optional: Python packages from `requirements-ocr.txt` when working on the OCR experiments
|
|
|
|
## Setup
|
|
1. Clone the repo and create a virtual environment: `python -m venv .venv && source .venv/bin/activate`
|
|
2. Install runtime dependencies: `pip install -r requirements.txt`
|
|
3. (Optional) Install OCR extras: `pip install -r requirements-ocr.txt`
|
|
4. (Optional) Add tooling such as pytest: `pip install -r dev-requirements.txt`
|
|
5. Provide credentials:
|
|
- Copy `settings.example.py` to `settings.py`
|
|
- Set `DOCUMENT_ID` (the identifier embedded in the Berlin status URL)
|
|
- Set `WEBHOOK_URL` pointing to the service that should receive status payloads
|
|
|
|
## Usage
|
|
Run `python main.py` to start a polling cycle. The script will:
|
|
1. Launch Firefox (set `USE_HEADLESS_MODE = True` in `main.py` for CI/servers)
|
|
2. Download the audio CAPTCHA into `audio_captchas/`
|
|
3. Transcribe it with Whisper via `transcription.py`
|
|
4. Parse the resulting status page and post `{status, last_updated}` to your webhook
|
|
|
|
Helpful utilities:
|
|
- `test_transcription()` inside `main.py` evaluates every mp3 in `audio_captchas/` and writes `transcription_results.csv`
|
|
- `test_parse_status_page()` parses the fixtures in `sample_html/` to validate the BeautifulSoup logic
|
|
- `ocr/recognize*.py` contains earlier OCR experiments for the visual CAPTCHA
|
|
|
|
## Testing
|
|
- Install dev tooling: `pip install -r dev-requirements.txt`
|
|
- Run `pytest`
|
|
|
|
## Data & Artifacts
|
|
- `audio_captchas/` collects downloaded mp3 files for debugging/benchmarking
|
|
- `captchas/` and `ocr/` scripts help capture and label the image CAPTCHAs
|
|
- `sample_html/` and `samples/` host anonymized HTML snapshots used for parsing tests
|
|
|
|
## Troubleshooting
|
|
- Whisper may need the `ffmpeg` binary; ensure `ffmpeg -version` works inside the venv
|
|
- If Selenium cannot start, verify `geckodriver --version` is available and matches your Firefox version
|
|
- For webhook issues, run a tool like `nc -l 8080` or `smee.io` to inspect outbound payloads |