Skip to content

Fixes uchicago-dsi/core-facility#103 (out-of-date scrapers)#24

Open
jpivarski wants to merge 6 commits into
mainfrom
jpivarski-fix-afdb-idb-dfc-proparco
Open

Fixes uchicago-dsi/core-facility#103 (out-of-date scrapers)#24
jpivarski wants to merge 6 commits into
mainfrom
jpivarski-fix-afdb-idb-dfc-proparco

Conversation

@jpivarski

@jpivarski jpivarski commented Mar 4, 2026

Copy link
Copy Markdown

Fixes uchicago-dsi/core-facility#103

I've been collecting logs of test commands and their output to show

  • what was broken before the fix
  • how those same tests are no longer broken

but they were too large to put in one comment box, so I made a series below.

Worth noting: IDB has self-healed since it was last tested.

@jpivarski

Copy link
Copy Markdown
Author

AFDB

docker run --rm -v "$PWD/services/extract/src:/app/src" --env-file "$PWD/services/extract/.env.dev" debit-scrapers bash -lc "cd /app/src/pipeline && uv run pytest ./extract/tests/integration/banks.py::TestWorkflows -k afdb -vv"

Before fix

Command output
warning: The `tool.uv.dev-dependencies` field (used in `/app/pyproject.toml`) is deprecated and will be removed in a future release; use `dependency-groups.dev` instead
Installed 14 packages in 3.62s
============================= test session starts ==============================
platform linux -- Python 3.11.0rc1, pytest-8.4.1, pluggy-1.6.0 -- /app/.venv/bin/python3
cachedir: .pytest_cache
django: version: 5.2.5, settings: config.settings (from env)
rootdir: /app
configfile: pyproject.toml
plugins: anyio-4.10.0, django-4.11.1
collecting ... collected 81 items / 79 deselected / 2 selected

extract/tests/integration/banks.py::TestWorkflows::test_partial_download[afdb] FAILED [ 50%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[afdb-https://mapafrica.afdb.org/api/v13/activities/46002-P-MZ-AA0-045/organisations] PASSED [100%]

=================================== FAILURES ===================================
__________________ TestWorkflows.test_partial_download[afdb] ___________________

self = <pipeline.extract.tests.integration.banks.TestWorkflows object at 0x7fa5ea964f50>
bank = 'afdb'
data_request_client = <common.http.DataRequestClient object at 0x7fa5ea8e8910>

    def test_partial_download(
        self, bank: str, data_request_client: DataRequestClient
    ) -> None:
        """Test of the `ProjectPartialDownloadWorkflow`.
    
        Asserts that downloading a data file from
        a URL does not result in an exception.
    
        Args:
            bank: The abbreviation for the bank or
                financial institution (e.g., "AFDB").
    
            data_request_client: An instance of a client used to make HTTP
                requests while rotating headers.
    
        Returns:
            `None`
        """
        workflow: ProjectPartialDownloadWorkflow = WorkflowClassRegistry.get(
            source=bank,
            workflow_type=settings.PROJECT_PARTIAL_DOWNLOAD_WORKFLOW,
            data_request_client=data_request_client,
            msg_queue_client=None,
            db_client=None,
        )
    
>       raw_projects = workflow.get_projects()
                       ^^^^^^^^^^^^^^^^^^^^^^^

extract/tests/integration/banks.py:183: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
extract/workflows/banks/afdb.py:211: in get_projects
    file_name = self._wait_for_download(session, download_id)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <extract.workflows.banks.afdb.AfdbProjectPartialDownloadWorkflow object at 0x7fa5ea8e8d90>
session = <curl_cffi.requests.session.Session object at 0x7fa5ea8e8d10>
download_id = 'ea734ed5-085a-4e3e-95bf-027e586f885f', max_checks = 3

    def _wait_for_download(
        self, session: requests.Session, download_id: str, max_checks: int = 3
    ) -> str:
        """Waits for a download to complete.
    
        Raises:
            RuntimeError: If the download fails, is not completed
                after the maximum number of checks has been reached,
                or has a response body that cannot be parsed.
    
        Args:
            session: A requests session.
    
            download_id: The download id.
    
            max_checks: The maximum number of times to check the download status.
                Defaults to 3.
    
        Returns:
            The name of the file to download.
        """
        # Wait until download is complete
        num_checks = 0
        while True:
            # Check download status
            r = session.get(
                f"https://mapafrica.afdb.org/api/v14/downloads/download/{download_id}",
                impersonate="chrome110",
                timeout=60,
            )
    
            # Raise exception if error occurred
            if not r.status_code:
                raise RuntimeError(
                    "Error downloading data. The request to "
                    "check the status of the data download "
                    f"failed with a status code of "
                    f'"{r.status_code} - {r.reason}".'
                )
    
            # Parse response body
            try:
                payload = r.json()
                if payload["state"] == "SUCCESS":
                    return payload["file"]
            except (json.JsonDecodeError, KeyError):
                raise RuntimeError(
                    "Error downloading data. The request to "
                    "check the status of the data download "
                    f"did not return the expected response "
                    f"payload: {r.text}."
                ) from None
    
            # Wait before checking status again
            time.sleep(10)
    
            # Iterate number of times status has been checked
            num_checks += 1
    
            # Raise exception if download has not completed
            if num_checks >= max_checks:
>               raise RuntimeError(
                    "Error downloading data. The data download "
                    f"has not completed after {max_checks} attempts."
                )
E               RuntimeError: Error downloading data. The data download has not completed after 3 attempts.

extract/workflows/banks/afdb.py:147: RuntimeError
=============================== warnings summary ===============================
../../.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:73
  /app/.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:73: DeprecationWarning: invalid escape sequence '\('
    'Digit9': {'keyCode': 57, 'code': 'Digit9', 'shiftKey': '\(', 'key': '9'},

../../.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:143
  /app/.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:143: DeprecationWarning: invalid escape sequence '\<'
    'Comma': {'keyCode': 188, 'code': 'Comma', 'shiftKey': '\<', 'key': ','},

../../.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:247
  /app/.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:247: DeprecationWarning: invalid escape sequence '\<'
    '<': {'keyCode': 188, 'key': '\<', 'code': 'Comma'},

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED extract/tests/integration/banks.py::TestWorkflows::test_partial_download[afdb] - RuntimeError: Error downloading data. The data download has not completed after 3 attempts.
=========== 1 failed, 1 passed, 79 deselected, 3 warnings in 38.36s ============

After fix

warning: The `tool.uv.dev-dependencies` field (used in `/app/pyproject.toml`) is deprecated and will be removed in a future release; use `dependency-groups.dev` instead
Installed 14 packages in 3.61s
============================= test session starts ==============================
platform linux -- Python 3.11.0rc1, pytest-8.4.1, pluggy-1.6.0 -- /app/.venv/bin/python3
cachedir: .pytest_cache
django: version: 5.2.5, settings: config.settings (from env)
rootdir: /app
configfile: pyproject.toml
plugins: anyio-4.10.0, django-4.11.1
collecting ... collected 81 items / 79 deselected / 2 selected

extract/tests/integration/banks.py::TestWorkflows::test_partial_download[afdb] PASSED [ 50%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[afdb-https://mapafrica.afdb.org/api/v13/activities/46002-P-MZ-AA0-045/organisations] PASSED [100%]

=============================== warnings summary ===============================
../../.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:73
  /app/.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:73: DeprecationWarning: invalid escape sequence '\('
    'Digit9': {'keyCode': 57, 'code': 'Digit9', 'shiftKey': '\(', 'key': '9'},

../../.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:143
  /app/.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:143: DeprecationWarning: invalid escape sequence '\<'
    'Comma': {'keyCode': 188, 'code': 'Comma', 'shiftKey': '\<', 'key': ','},

../../.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:247
  /app/.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:247: DeprecationWarning: invalid escape sequence '\<'
    '<': {'keyCode': 188, 'key': '\<', 'code': 'Comma'},

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================ 2 passed, 79 deselected, 3 warnings in 30.88s =================

@jpivarski

Copy link
Copy Markdown
Author

IDB

docker run --rm -v "$PWD/services/extract/src:/app/src" --env-file "$PWD/services/extract/.env.dev" debit-scrapers bash -lc "cd /app/src/pipeline && uv run pytest ./extract/tests/integration/banks.py::TestWorkflows -k idb -vv"

No changes

Although the issue says that there was an error fetching an authentication token, I didn't see that. Without making any changes, all tests pass.

Command output
warning: The `tool.uv.dev-dependencies` field (used in `/app/pyproject.toml`) is deprecated and will be removed in a future release; use `dependency-groups.dev` instead
Installed 14 packages in 3.60s
============================= test session starts ==============================
platform linux -- Python 3.11.0rc1, pytest-8.4.1, pluggy-1.6.0 -- /app/.venv/bin/python3
cachedir: .pytest_cache
django: version: 5.2.5, settings: config.settings (from env)
rootdir: /app
configfile: pyproject.toml
plugins: anyio-4.10.0, django-4.11.1
collecting ... collected 81 items / 74 deselected / 7 selected

extract/tests/integration/banks.py::TestWorkflows::test_partial_download[idb] PASSED [ 14%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[idb-https://www.iadb.org/en/project/BO0060] PASSED [ 28%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[idb-https://www.iadb.org/en/project/CO-T1792] PASSED [ 42%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[idb-https://www.iadb.org/en/project/RG-Q0153] PASSED [ 57%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[idb-https://www.iadb.org/en/project-search?page=0] PASSED [ 71%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[idb-https://www.iadb.org/en/project-search?page=400] PASSED [ 85%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[idb-https://www.iadb.org/en/project-search?page=997] PASSED [100%]

=============================== warnings summary ===============================
../../.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:73
  /app/.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:73: DeprecationWarning: invalid escape sequence '\('
    'Digit9': {'keyCode': 57, 'code': 'Digit9', 'shiftKey': '\(', 'key': '9'},

../../.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:143
  /app/.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:143: DeprecationWarning: invalid escape sequence '\<'
    'Comma': {'keyCode': 188, 'code': 'Comma', 'shiftKey': '\<', 'key': ','},

../../.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:247
  /app/.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:247: DeprecationWarning: invalid escape sequence '\<'
    '<': {'keyCode': 188, 'key': '\<', 'code': 'Comma'},

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================ 7 passed, 74 deselected, 3 warnings in 52.56s =================

@jpivarski

Copy link
Copy Markdown
Author

DFC

docker run --rm -v "$PWD/services/extract/src:/app/src" --env-file "$PWD/services/extract/.env.dev" debit-scrapers bash -lc "cd /app/src/pipeline && uv run pytest ./extract/tests/integration/banks.py::TestWorkflows::test_download -k dfc -vv"

Before fix

Although the issue says that this one was already fixed, I had to allow for a space in the Fiscal Year column name.

Command output
warning: The `tool.uv.dev-dependencies` field (used in `/app/pyproject.toml`) is deprecated and will be removed in a future release; use `dependency-groups.dev` instead
Installed 14 packages in 3.63s
============================= test session starts ==============================
platform linux -- Python 3.11.0rc1, pytest-8.4.1, pluggy-1.6.0 -- /app/.venv/bin/python3
cachedir: .pytest_cache
django: version: 5.2.5, settings: config.settings (from env)
rootdir: /app
configfile: pyproject.toml
plugins: anyio-4.10.0, django-4.11.1
collecting ... collected 4 items / 3 deselected / 1 selected

extract/tests/integration/banks.py::TestWorkflows::test_download[dfc] FAILED [100%]

=================================== FAILURES ===================================
_______________________ TestWorkflows.test_download[dfc] _______________________

self = <pipeline.extract.tests.integration.banks.TestWorkflows object at 0x721b05808c90>
bank = 'dfc'
data_request_client = <common.http.DataRequestClient object at 0x721b054f8450>

    def test_download(self, bank: str, data_request_client: DataRequestClient) -> None:
        """Test of the `ProjectDownloadWorkflow`.
    
        Asserts that downloading a data file from
        a URL does not result in an exception.
    
        Args:
            bank: The abbreviation for the bank or
                financial institution (e.g., "AFDB").
    
            data_request_client: An instance of a client used to make HTTP
                requests while rotating headers.
    
        Returns:
            `None`
        """
        workflow: ProjectDownloadWorkflow = WorkflowClassRegistry.get(
            source=bank,
            workflow_type=settings.PROJECT_DOWNLOAD_WORKFLOW,
            data_request_client=data_request_client,
            msg_queue_client=None,
            db_client=None,
        )
    
        raw_projects = workflow.get_projects()
        raw_projects.to_csv(
            settings.EXTRACT_TEST_RESULT_DIR / f"{bank}_raw_projects.csv",
            index=False,
        )
    
>       clean_projects = workflow.clean_projects(raw_projects)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

extract/tests/integration/banks.py:151: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <extract.workflows.banks.dfc.DfcDownloadWorkflow object at 0x721b055db6d0>
df =      Fiscal Year Project Number  ... URL Frequency                     countries
0         2014.0     9000003464  ... ...        Kenya
688       2025.0     9100000094  ...             1  Democratic Republic Of Congo

[689 rows x 14 columns]

    def clean_projects(self, df: pd.DataFrame) -> pd.DataFrame:
        """Cleans DFC project records to conform to an expected schema.
    
        NOTE: Unique DFC projects can point to the same project page URL
        if they are related, which breaks current database constraints.
        To work around this issue, a faux anchor link with the project
        number is appended to the URL whenever more than one reference to
        that URL exists (e.g., "https://www.dfc.gov/sites/default/files/media/documents/9000115501_0.pdf#9000115550").
    
        References:
        - https://www.dfc.gov/what-we-do/active-projects
        - https://www.dfc.gov/our-impact/transaction-data
    
        Args:
            df: The raw project records.
    
        Returns:
            The cleaned records.
        """
        try:
            # Drop records w/o project page URLs (public information sheets)
            df = df.query("`Project Profile URL` == `Project Profile URL`")
    
            # Aggregate project commitments by project number
            agg_map = {
                col: "sum" if col == "Committed" else "first" for col in df.columns
            }
            df = df.groupby("Project Number").agg(agg_map).reset_index(drop=True)
    
            # Count frequency of project page URLs
            url_frequencies_df = (
                df[["Project Profile URL"]]
                .groupby("Project Profile URL")
                .size()
                .reset_index()
                .rename(columns={0: "URL Frequency"})
            )
    
            # Merge frequencies with existing data
            df = df.merge(url_frequencies_df, on="Project Profile URL", how="left")
    
            # Create output columns
            df["countries"] = df["Country"]
            df["fiscal_year_effective"] = df["Fiscal_Year"].astype(int)
            df["finance_types"] = df["Project Type"]
            df["name"] = df["Project Name"]
            df["number"] = df["Project Number"]
            df["sectors"] = df["NAICS Sector"]
            df["source"] = settings.DFC_ABBREVIATION.upper()
            df["total_amount"] = df["total_amount_usd"] = df["Committed"]
            df["total_amount_currency"] = "USD"
            df["url"] = df.apply(
                lambda row: row["Project Profile URL"]
                + ("" if row["URL Frequency"] == 1 else f"#{row['Project Number']}"),
                axis=1,
            )
    
            # Finalize columns
            df = df[
                [
                    "countries",
                    "fiscal_year_effective",
                    "finance_types",
                    "name",
                    "number",
                    "sectors",
                    "source",
                    "total_amount",
                    "total_amount_currency",
                    "url",
                ]
            ]
    
            return df
    
        except Exception as e:
>           raise RuntimeError(f"Failed to clean DFC projects. {e}") from None
E           RuntimeError: Failed to clean DFC projects. 'Fiscal_Year'

extract/workflows/banks/dfc.py:133: RuntimeError
=============================== warnings summary ===============================
../../.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:73
  /app/.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:73: DeprecationWarning: invalid escape sequence '\('
    'Digit9': {'keyCode': 57, 'code': 'Digit9', 'shiftKey': '\(', 'key': '9'},

../../.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:143
  /app/.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:143: DeprecationWarning: invalid escape sequence '\<'
    'Comma': {'keyCode': 188, 'code': 'Comma', 'shiftKey': '\<', 'key': ','},

../../.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:247
  /app/.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:247: DeprecationWarning: invalid escape sequence '\<'
    '<': {'keyCode': 188, 'key': '\<', 'code': 'Comma'},

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED extract/tests/integration/banks.py::TestWorkflows::test_download[dfc] - RuntimeError: Failed to clean DFC projects. 'Fiscal_Year'
================= 1 failed, 3 deselected, 3 warnings in 3.58s ==================

After fix

Command output
warning: The `tool.uv.dev-dependencies` field (used in `/app/pyproject.toml`) is deprecated and will be removed in a future release; use `dependency-groups.dev` instead
Installed 14 packages in 3.90s
============================= test session starts ==============================
platform linux -- Python 3.11.0rc1, pytest-8.4.1, pluggy-1.6.0 -- /app/.venv/bin/python3
cachedir: .pytest_cache
django: version: 5.2.5, settings: config.settings (from env)
rootdir: /app
configfile: pyproject.toml
plugins: anyio-4.10.0, django-4.11.1
collecting ... collected 4 items / 3 deselected / 1 selected

extract/tests/integration/banks.py::TestWorkflows::test_download[dfc] PASSED [100%]

=============================== warnings summary ===============================
../../.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:73
  /app/.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:73: DeprecationWarning: invalid escape sequence '\('
    'Digit9': {'keyCode': 57, 'code': 'Digit9', 'shiftKey': '\(', 'key': '9'},

../../.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:143
  /app/.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:143: DeprecationWarning: invalid escape sequence '\<'
    'Comma': {'keyCode': 188, 'code': 'Comma', 'shiftKey': '\<', 'key': ','},

../../.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:247
  /app/.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:247: DeprecationWarning: invalid escape sequence '\<'
    '<': {'keyCode': 188, 'key': '\<', 'code': 'Comma'},

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================= 1 passed, 3 deselected, 3 warnings in 3.51s ==================

@jpivarski

Copy link
Copy Markdown
Author

Proparco

Two tests:

docker run --rm -v "$PWD/services/extract/src:/app/src" --env-file "$PWD/services/extract/.env.dev" debit-scrapers bash -lc "cd /app/src/pipeline && uv run pytest ./extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape -k pro -vv"

and

docker run --rm -v "$PWD/services/extract/src:/app/src" --env-file "$PWD/services/extract/.env.dev" debit-scrapers bash -lc "cd /app/src/pipeline && uv run pytest ./extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape -k pro -vv"

The issue says that this one had a datetime format error that was already fixed. I must have been looking for the baseline code in the wrong place because I had to fix that format error myself.

Before fix

Command output
warning: The `tool.uv.dev-dependencies` field (used in `/app/pyproject.toml`) is deprecated and will be removed in a future release; use `dependency-groups.dev` instead
Installed 14 packages in 3.72s
============================= test session starts ==============================
platform linux -- Python 3.11.0rc1, pytest-8.4.1, pluggy-1.6.0 -- /app/.venv/bin/python3
cachedir: .pytest_cache
django: version: 5.2.5, settings: config.settings (from env)
rootdir: /app
configfile: pyproject.toml
plugins: anyio-4.10.0, django-4.11.1
collecting ... collected 12 items / 3 deselected / 9 selected

extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[idb-https://www.iadb.org/en/project-search?page=0] PASSED [ 11%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[idb-https://www.iadb.org/en/project-search?page=400] PASSED [ 22%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[idb-https://www.iadb.org/en/project-search?page=997] PASSED [ 33%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[miga-https://www.miga.org/projects?page=0] PASSED [ 44%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[miga-https://www.miga.org/projects?page=68] PASSED [ 55%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[miga-https://www.miga.org/projects?page=118] PASSED [ 66%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[pro-https://www.proparco.fr/en/projects/list?page=0] PASSED [ 77%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[pro-https://www.proparco.fr/en/projects/list?page=25] PASSED [ 88%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[pro-https://www.proparco.fr/en/projects/list?page=39] PASSED [100%]

=============================== warnings summary ===============================
../../.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:73
  /app/.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:73: DeprecationWarning: invalid escape sequence '\('
    'Digit9': {'keyCode': 57, 'code': 'Digit9', 'shiftKey': '\(', 'key': '9'},

../../.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:143
  /app/.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:143: DeprecationWarning: invalid escape sequence '\<'
    'Comma': {'keyCode': 188, 'code': 'Comma', 'shiftKey': '\<', 'key': ','},

../../.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:247
  /app/.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:247: DeprecationWarning: invalid escape sequence '\<'
    '<': {'keyCode': 188, 'key': '\<', 'code': 'Comma'},

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================= 9 passed, 3 deselected, 3 warnings in 51.55s =================

and

warning: The `tool.uv.dev-dependencies` field (used in `/app/pyproject.toml`) is deprecated and will be removed in a future release; use `dependency-groups.dev` instead
Installed 14 packages in 3.59s
============================= test session starts ==============================
platform linux -- Python 3.11.0rc1, pytest-8.4.1, pluggy-1.6.0 -- /app/.venv/bin/python3
cachedir: .pytest_cache
django: version: 5.2.5, settings: config.settings (from env)
rootdir: /app
configfile: pyproject.toml
plugins: anyio-4.10.0, django-4.11.1
collecting ... collected 27 items

extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-https://www.adb.org/iati/iati-activities-af.xml] PASSED [  3%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-http://www.adb.org/iati/iati-activities-fj.xml] PASSED [  7%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-http://www.adb.org/iati/iati-activities-id.xml] PASSED [ 11%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-http://www.adb.org/iati/iati-activities-in.xml] PASSED [ 14%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-https://www.adb.org/iati/iati-activities-kh.xml] PASSED [ 18%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-http://www.adb.org/iati/iati-activities-mn.xml] PASSED [ 22%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-http://www.adb.org/iati/iati-activities-ph.xml] PASSED [ 25%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-https://www.adb.org/iati/iati-activities-reg.xml] PASSED [ 29%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-http://www.adb.org/iati/iati-activities-uz.xml] PASSED [ 33%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-http://www.adb.org/iati/iati-activities-vn.xml] PASSED [ 37%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[aiib-https://www.aiib.org/en/projects/details/2023/approved/Egypt-Sustainable-Transport-and-Digital-Infrastructure-Guarantee.html] PASSED [ 40%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[aiib-https://www.aiib.org/en/projects/details/2016/approved/Tajikistan-Dushanbe-Uzbekistan-Border-Road-Improvement.html] PASSED [ 44%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[aiib-https://www.aiib.org/en/projects/details/2023/proposed/Viet-Nam-Gia-Lai-Wind-Power-Project.html] PASSED [ 48%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[fmo-https://www.fmo.nl/project-detail/45033] PASSED [ 51%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[fmo-https://www.fmo.nl/project-detail/60377] PASSED [ 55%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[fmo-https://www.fmo.nl/project-detail/62828] PASSED [ 59%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[ifc-https://disclosuresservice.ifc.org/api/searchprovider/searchenterpriseprojects?payload=*&$start=0$srt=disclosed_date$order=desc$rows=100] PASSED [ 62%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[ifc-https://disclosuresservice.ifc.org/api/searchprovider/searchenterpriseprojects?payload=*&$start=200$srt=disclosed_date$order=desc$rows=100] PASSED [ 66%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[ifc-https://disclosuresservice.ifc.org/api/searchprovider/searchenterpriseprojects?payload=*&$start=8300$srt=disclosed_date$order=desc$rows=100] PASSED [ 70%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[ifc-https://disclosuresservice.ifc.org/api/searchprovider/searchenterpriseprojects?payload=*&$start=11800$srt=disclosed_date$order=desc$rows=100] PASSED [ 74%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[ifc-https://disclosuresservice.ifc.org/api/searchprovider/searchenterpriseprojects?payload=*&$start=11900$srt=disclosed_date$order=desc$rows=100] PASSED [ 77%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[miga-https://www.miga.org/project/bboxx-rwanda-kenya-and-drc-0] PASSED [ 81%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[miga-https://www.miga.org/project/dedicated-freight-corridor-corporation-india-limited-1] PASSED [ 85%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[miga-https://www.miga.org/project/koridori-srbije-ltd-morava-motorway-0] PASSED [ 88%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[pro-https://www.proparco.fr/en/carte-des-projets/ecobank-trade-finance] FAILED [ 92%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[pro-https://www.proparco.fr/en/carte-des-projets/darby-latam-pf-iii] FAILED [ 96%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[pro-https://www.proparco.fr/en/carte-des-projets/mpef-iv] FAILED [100%]

=================================== FAILURES ===================================
_ TestWorkflows.test_project_page_scrape[pro-https://www.proparco.fr/en/carte-des-projets/ecobank-trade-finance] _

self = <pipeline.extract.tests.integration.banks.TestWorkflows object at 0x79ca7a8d0790>
bank = 'pro'
project_page_url = 'https://www.proparco.fr/en/carte-des-projets/ecobank-trade-finance'
data_request_client = <common.http.DataRequestClient object at 0x79ca7a035a50>

    def test_project_page_scrape(
        self,
        bank: str,
        project_page_url: str,
        data_request_client: DataRequestClient,
    ) -> None:
        """Test of the `ProjectScrapeWorkflow`.
    
        Asserts that scraping a project page for data
        does not result in an exception. Sleeps for three
        seconds in between HTTP calls to avoid potential
        throttling by the bank website.
    
        Args:
            bank: The abbreviation for the bank or
                financial institution (e.g., "AFDB").
    
            project_page_url: A URL to a project
                page on the bank's website.
    
            data_request_client: An instance of a client used
                to make HTTP requests while rotating headers.
    
        Returns:
            `None`
        """
        workflow: ProjectScrapeWorkflow = WorkflowClassRegistry.get(
            source=bank,
            workflow_type=settings.PROJECT_PAGE_WORKFLOW,
            data_request_client=data_request_client,
            msg_queue_client=None,
            db_client=None,
        )
        time.sleep(3)
>       project = workflow.scrape_project_page(project_page_url)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

extract/tests/integration/banks.py:262: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
extract/workflows/banks/pro.py:189: in scrape_project_page
    parsed_date = datetime.strptime(raw_date, "%B %d %Y")
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/usr/lib/python3.11/_strptime.py:568: in _strptime_datetime
    tt, fraction, gmtoff_fraction = _strptime(data_string, format)
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

data_string = 'October 9, 2017', format = '%B %d %Y'

    def _strptime(data_string, format="%a %b %d %H:%M:%S %Y"):
        """Return a 2-tuple consisting of a time struct and an int containing
        the number of microseconds based on the input string and the
        format string."""
    
        for index, arg in enumerate([data_string, format]):
            if not isinstance(arg, str):
                msg = "strptime() argument {} must be str, not {}"
                raise TypeError(msg.format(index, type(arg)))
    
        global _TimeRE_cache, _regex_cache
        with _cache_lock:
            locale_time = _TimeRE_cache.locale_time
            if (_getlang() != locale_time.lang or
                time.tzname != locale_time.tzname or
                time.daylight != locale_time.daylight):
                _TimeRE_cache = TimeRE()
                _regex_cache.clear()
                locale_time = _TimeRE_cache.locale_time
            if len(_regex_cache) > _CACHE_MAX_SIZE:
                _regex_cache.clear()
            format_regex = _regex_cache.get(format)
            if not format_regex:
                try:
                    format_regex = _TimeRE_cache.compile(format)
                # KeyError raised when a bad format is found; can be specified as
                # \\, in which case it was a stray % but with a space after it
                except KeyError as err:
                    bad_directive = err.args[0]
                    if bad_directive == "\\":
                        bad_directive = "%"
                    del err
                    raise ValueError("'%s' is a bad directive in format '%s'" %
                                        (bad_directive, format)) from None
                # IndexError only occurs when the format string is "%"
                except IndexError:
                    raise ValueError("stray %% in format '%s'" % format) from None
                _regex_cache[format] = format_regex
        found = format_regex.match(data_string)
        if not found:
>           raise ValueError("time data %r does not match format %r" %
                             (data_string, format))
E           ValueError: time data 'October 9, 2017' does not match format '%B %d %Y'

/usr/lib/python3.11/_strptime.py:349: ValueError
_ TestWorkflows.test_project_page_scrape[pro-https://www.proparco.fr/en/carte-des-projets/darby-latam-pf-iii] _

self = <pipeline.extract.tests.integration.banks.TestWorkflows object at 0x79ca7a8d0510>
bank = 'pro'
project_page_url = 'https://www.proparco.fr/en/carte-des-projets/darby-latam-pf-iii'
data_request_client = <common.http.DataRequestClient object at 0x79ca7a383110>

    def test_project_page_scrape(
        self,
        bank: str,
        project_page_url: str,
        data_request_client: DataRequestClient,
    ) -> None:
        """Test of the `ProjectScrapeWorkflow`.
    
        Asserts that scraping a project page for data
        does not result in an exception. Sleeps for three
        seconds in between HTTP calls to avoid potential
        throttling by the bank website.
    
        Args:
            bank: The abbreviation for the bank or
                financial institution (e.g., "AFDB").
    
            project_page_url: A URL to a project
                page on the bank's website.
    
            data_request_client: An instance of a client used
                to make HTTP requests while rotating headers.
    
        Returns:
            `None`
        """
        workflow: ProjectScrapeWorkflow = WorkflowClassRegistry.get(
            source=bank,
            workflow_type=settings.PROJECT_PAGE_WORKFLOW,
            data_request_client=data_request_client,
            msg_queue_client=None,
            db_client=None,
        )
        time.sleep(3)
>       project = workflow.scrape_project_page(project_page_url)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

extract/tests/integration/banks.py:262: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
extract/workflows/banks/pro.py:189: in scrape_project_page
    parsed_date = datetime.strptime(raw_date, "%B %d %Y")
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/usr/lib/python3.11/_strptime.py:568: in _strptime_datetime
    tt, fraction, gmtoff_fraction = _strptime(data_string, format)
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

data_string = 'December 15, 2017', format = '%B %d %Y'

    def _strptime(data_string, format="%a %b %d %H:%M:%S %Y"):
        """Return a 2-tuple consisting of a time struct and an int containing
        the number of microseconds based on the input string and the
        format string."""
    
        for index, arg in enumerate([data_string, format]):
            if not isinstance(arg, str):
                msg = "strptime() argument {} must be str, not {}"
                raise TypeError(msg.format(index, type(arg)))
    
        global _TimeRE_cache, _regex_cache
        with _cache_lock:
            locale_time = _TimeRE_cache.locale_time
            if (_getlang() != locale_time.lang or
                time.tzname != locale_time.tzname or
                time.daylight != locale_time.daylight):
                _TimeRE_cache = TimeRE()
                _regex_cache.clear()
                locale_time = _TimeRE_cache.locale_time
            if len(_regex_cache) > _CACHE_MAX_SIZE:
                _regex_cache.clear()
            format_regex = _regex_cache.get(format)
            if not format_regex:
                try:
                    format_regex = _TimeRE_cache.compile(format)
                # KeyError raised when a bad format is found; can be specified as
                # \\, in which case it was a stray % but with a space after it
                except KeyError as err:
                    bad_directive = err.args[0]
                    if bad_directive == "\\":
                        bad_directive = "%"
                    del err
                    raise ValueError("'%s' is a bad directive in format '%s'" %
                                        (bad_directive, format)) from None
                # IndexError only occurs when the format string is "%"
                except IndexError:
                    raise ValueError("stray %% in format '%s'" % format) from None
                _regex_cache[format] = format_regex
        found = format_regex.match(data_string)
        if not found:
>           raise ValueError("time data %r does not match format %r" %
                             (data_string, format))
E           ValueError: time data 'December 15, 2017' does not match format '%B %d %Y'

/usr/lib/python3.11/_strptime.py:349: ValueError
_ TestWorkflows.test_project_page_scrape[pro-https://www.proparco.fr/en/carte-des-projets/mpef-iv] _

self = <pipeline.extract.tests.integration.banks.TestWorkflows object at 0x79ca7a8d0090>
bank = 'pro'
project_page_url = 'https://www.proparco.fr/en/carte-des-projets/mpef-iv'
data_request_client = <common.http.DataRequestClient object at 0x79ca7a6887d0>

    def test_project_page_scrape(
        self,
        bank: str,
        project_page_url: str,
        data_request_client: DataRequestClient,
    ) -> None:
        """Test of the `ProjectScrapeWorkflow`.
    
        Asserts that scraping a project page for data
        does not result in an exception. Sleeps for three
        seconds in between HTTP calls to avoid potential
        throttling by the bank website.
    
        Args:
            bank: The abbreviation for the bank or
                financial institution (e.g., "AFDB").
    
            project_page_url: A URL to a project
                page on the bank's website.
    
            data_request_client: An instance of a client used
                to make HTTP requests while rotating headers.
    
        Returns:
            `None`
        """
        workflow: ProjectScrapeWorkflow = WorkflowClassRegistry.get(
            source=bank,
            workflow_type=settings.PROJECT_PAGE_WORKFLOW,
            data_request_client=data_request_client,
            msg_queue_client=None,
            db_client=None,
        )
        time.sleep(3)
>       project = workflow.scrape_project_page(project_page_url)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

extract/tests/integration/banks.py:262: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
extract/workflows/banks/pro.py:189: in scrape_project_page
    parsed_date = datetime.strptime(raw_date, "%B %d %Y")
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/usr/lib/python3.11/_strptime.py:568: in _strptime_datetime
    tt, fraction, gmtoff_fraction = _strptime(data_string, format)
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

data_string = 'February 28, 2018', format = '%B %d %Y'

    def _strptime(data_string, format="%a %b %d %H:%M:%S %Y"):
        """Return a 2-tuple consisting of a time struct and an int containing
        the number of microseconds based on the input string and the
        format string."""
    
        for index, arg in enumerate([data_string, format]):
            if not isinstance(arg, str):
                msg = "strptime() argument {} must be str, not {}"
                raise TypeError(msg.format(index, type(arg)))
    
        global _TimeRE_cache, _regex_cache
        with _cache_lock:
            locale_time = _TimeRE_cache.locale_time
            if (_getlang() != locale_time.lang or
                time.tzname != locale_time.tzname or
                time.daylight != locale_time.daylight):
                _TimeRE_cache = TimeRE()
                _regex_cache.clear()
                locale_time = _TimeRE_cache.locale_time
            if len(_regex_cache) > _CACHE_MAX_SIZE:
                _regex_cache.clear()
            format_regex = _regex_cache.get(format)
            if not format_regex:
                try:
                    format_regex = _TimeRE_cache.compile(format)
                # KeyError raised when a bad format is found; can be specified as
                # \\, in which case it was a stray % but with a space after it
                except KeyError as err:
                    bad_directive = err.args[0]
                    if bad_directive == "\\":
                        bad_directive = "%"
                    del err
                    raise ValueError("'%s' is a bad directive in format '%s'" %
                                        (bad_directive, format)) from None
                # IndexError only occurs when the format string is "%"
                except IndexError:
                    raise ValueError("stray %% in format '%s'" % format) from None
                _regex_cache[format] = format_regex
        found = format_regex.match(data_string)
        if not found:
>           raise ValueError("time data %r does not match format %r" %
                             (data_string, format))
E           ValueError: time data 'February 28, 2018' does not match format '%B %d %Y'

/usr/lib/python3.11/_strptime.py:349: ValueError
=============================== warnings summary ===============================
../../.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:73
  /app/.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:73: DeprecationWarning: invalid escape sequence '\('
    'Digit9': {'keyCode': 57, 'code': 'Digit9', 'shiftKey': '\(', 'key': '9'},

../../.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:143
  /app/.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:143: DeprecationWarning: invalid escape sequence '\<'
    'Comma': {'keyCode': 188, 'code': 'Comma', 'shiftKey': '\<', 'key': ','},

../../.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:247
  /app/.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:247: DeprecationWarning: invalid escape sequence '\<'
    '<': {'keyCode': 188, 'key': '\<', 'code': 'Comma'},

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[pro-https://www.proparco.fr/en/carte-des-projets/ecobank-trade-finance] - ValueError: time data 'October 9, 2017' does not match format '%B %d %Y'
FAILED extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[pro-https://www.proparco.fr/en/carte-des-projets/darby-latam-pf-iii] - ValueError: time data 'December 15, 2017' does not match format '%B %d %Y'
FAILED extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[pro-https://www.proparco.fr/en/carte-des-projets/mpef-iv] - ValueError: time data 'February 28, 2018' does not match format '%B %d %Y'
============= 3 failed, 24 passed, 3 warnings in 129.42s (0:02:09) =============

After fix

Command output
warning: The `tool.uv.dev-dependencies` field (used in `/app/pyproject.toml`) is deprecated and will be removed in a future release; use `dependency-groups.dev` instead
Installed 14 packages in 3.85s
============================= test session starts ==============================
platform linux -- Python 3.11.0rc1, pytest-8.4.1, pluggy-1.6.0 -- /app/.venv/bin/python3
cachedir: .pytest_cache
django: version: 5.2.5, settings: config.settings (from env)
rootdir: /app
configfile: pyproject.toml
plugins: anyio-4.10.0, django-4.11.1
collecting ... collected 12 items / 3 deselected / 9 selected

extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[idb-https://www.iadb.org/en/project-search?page=0] PASSED [ 11%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[idb-https://www.iadb.org/en/project-search?page=400] PASSED [ 22%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[idb-https://www.iadb.org/en/project-search?page=997] PASSED [ 33%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[miga-https://www.miga.org/projects?page=0] PASSED [ 44%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[miga-https://www.miga.org/projects?page=68] PASSED [ 55%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[miga-https://www.miga.org/projects?page=118] PASSED [ 66%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[pro-https://www.proparco.fr/en/projects/list?page=0] PASSED [ 77%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[pro-https://www.proparco.fr/en/projects/list?page=25] PASSED [ 88%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[pro-https://www.proparco.fr/en/projects/list?page=39] PASSED [100%]

=============================== warnings summary ===============================
../../.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:73
  /app/.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:73: DeprecationWarning: invalid escape sequence '\('
    'Digit9': {'keyCode': 57, 'code': 'Digit9', 'shiftKey': '\(', 'key': '9'},

../../.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:143
  /app/.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:143: DeprecationWarning: invalid escape sequence '\<'
    'Comma': {'keyCode': 188, 'code': 'Comma', 'shiftKey': '\<', 'key': ','},

../../.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:247
  /app/.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:247: DeprecationWarning: invalid escape sequence '\<'
    '<': {'keyCode': 188, 'key': '\<', 'code': 'Comma'},

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================= 9 passed, 3 deselected, 3 warnings in 46.19s =================

and

warning: The `tool.uv.dev-dependencies` field (used in `/app/pyproject.toml`) is deprecated and will be removed in a future release; use `dependency-groups.dev` instead
Installed 14 packages in 3.58s
============================= test session starts ==============================
platform linux -- Python 3.11.0rc1, pytest-8.4.1, pluggy-1.6.0 -- /app/.venv/bin/python3
cachedir: .pytest_cache
django: version: 5.2.5, settings: config.settings (from env)
rootdir: /app
configfile: pyproject.toml
plugins: anyio-4.10.0, django-4.11.1
collecting ... collected 27 items

extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-https://www.adb.org/iati/iati-activities-af.xml] PASSED [  3%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-http://www.adb.org/iati/iati-activities-fj.xml] PASSED [  7%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-http://www.adb.org/iati/iati-activities-id.xml] PASSED [ 11%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-http://www.adb.org/iati/iati-activities-in.xml] PASSED [ 14%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-https://www.adb.org/iati/iati-activities-kh.xml] PASSED [ 18%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-http://www.adb.org/iati/iati-activities-mn.xml] PASSED [ 22%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-http://www.adb.org/iati/iati-activities-ph.xml] PASSED [ 25%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-https://www.adb.org/iati/iati-activities-reg.xml] PASSED [ 29%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-http://www.adb.org/iati/iati-activities-uz.xml] PASSED [ 33%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-http://www.adb.org/iati/iati-activities-vn.xml] PASSED [ 37%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[aiib-https://www.aiib.org/en/projects/details/2023/approved/Egypt-Sustainable-Transport-and-Digital-Infrastructure-Guarantee.html] PASSED [ 40%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[aiib-https://www.aiib.org/en/projects/details/2016/approved/Tajikistan-Dushanbe-Uzbekistan-Border-Road-Improvement.html] PASSED [ 44%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[aiib-https://www.aiib.org/en/projects/details/2023/proposed/Viet-Nam-Gia-Lai-Wind-Power-Project.html] PASSED [ 48%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[fmo-https://www.fmo.nl/project-detail/45033] PASSED [ 51%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[fmo-https://www.fmo.nl/project-detail/60377] PASSED [ 55%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[fmo-https://www.fmo.nl/project-detail/62828] PASSED [ 59%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[ifc-https://disclosuresservice.ifc.org/api/searchprovider/searchenterpriseprojects?payload=*&$start=0$srt=disclosed_date$order=desc$rows=100] PASSED [ 62%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[ifc-https://disclosuresservice.ifc.org/api/searchprovider/searchenterpriseprojects?payload=*&$start=200$srt=disclosed_date$order=desc$rows=100] PASSED [ 66%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[ifc-https://disclosuresservice.ifc.org/api/searchprovider/searchenterpriseprojects?payload=*&$start=8300$srt=disclosed_date$order=desc$rows=100] PASSED [ 70%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[ifc-https://disclosuresservice.ifc.org/api/searchprovider/searchenterpriseprojects?payload=*&$start=11800$srt=disclosed_date$order=desc$rows=100] PASSED [ 74%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[ifc-https://disclosuresservice.ifc.org/api/searchprovider/searchenterpriseprojects?payload=*&$start=11900$srt=disclosed_date$order=desc$rows=100] PASSED [ 77%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[miga-https://www.miga.org/project/bboxx-rwanda-kenya-and-drc-0] PASSED [ 81%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[miga-https://www.miga.org/project/dedicated-freight-corridor-corporation-india-limited-1] PASSED [ 85%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[miga-https://www.miga.org/project/koridori-srbije-ltd-morava-motorway-0] PASSED [ 88%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[pro-https://www.proparco.fr/en/carte-des-projets/ecobank-trade-finance] PASSED [ 92%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[pro-https://www.proparco.fr/en/carte-des-projets/darby-latam-pf-iii] PASSED [ 96%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[pro-https://www.proparco.fr/en/carte-des-projets/mpef-iv] PASSED [100%]

=============================== warnings summary ===============================
../../.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:73
  /app/.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:73: DeprecationWarning: invalid escape sequence '\('
    'Digit9': {'keyCode': 57, 'code': 'Digit9', 'shiftKey': '\(', 'key': '9'},

../../.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:143
  /app/.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:143: DeprecationWarning: invalid escape sequence '\<'
    'Comma': {'keyCode': 188, 'code': 'Comma', 'shiftKey': '\<', 'key': ','},

../../.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:247
  /app/.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:247: DeprecationWarning: invalid escape sequence '\<'
    '<': {'keyCode': 188, 'key': '\<', 'code': 'Comma'},

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================== 27 passed, 3 warnings in 130.71s (0:02:10) ==================

@jpivarski

Copy link
Copy Markdown
Author

Testing everything

docker run --rm -v "$PWD/services/extract/src:/app/src" --env-file "$PWD/services/extract/.env.dev" debit-scrapers bash -lc "cd /app/src/pipeline && uv run pytest ./extract/tests/integration/banks.py::TestWorkflows -vv"

Before fix

Command output
warning: The `tool.uv.dev-dependencies` field (used in `/app/pyproject.toml`) is deprecated and will be removed in a future release; use `dependency-groups.dev` instead
Installed 14 packages in 3.72s
============================= test session starts ==============================
platform linux -- Python 3.11.0rc1, pytest-8.4.1, pluggy-1.6.0 -- /app/.venv/bin/python3
cachedir: .pytest_cache
django: version: 5.2.5, settings: config.settings (from env)
rootdir: /app
configfile: pyproject.toml
plugins: anyio-4.10.0, django-4.11.1
collecting ... collected 81 items

extract/tests/integration/banks.py::TestWorkflows::test_download[deg] PASSED [  1%]
extract/tests/integration/banks.py::TestWorkflows::test_download[dfc] PASSED [  2%]
extract/tests/integration/banks.py::TestWorkflows::test_download[kfw] PASSED [  3%]
extract/tests/integration/banks.py::TestWorkflows::test_download[wb] PASSED [  4%]
extract/tests/integration/banks.py::TestWorkflows::test_partial_download[afdb] PASSED [  6%]
extract/tests/integration/banks.py::TestWorkflows::test_partial_download[ebrd] FAILED [  7%]
extract/tests/integration/banks.py::TestWorkflows::test_partial_download[idb] PASSED [  8%]
extract/tests/integration/banks.py::TestWorkflows::test_generate_seed_urls[adb] PASSED [  9%]
extract/tests/integration/banks.py::TestWorkflows::test_generate_seed_urls[aiib] PASSED [ 11%]
extract/tests/integration/banks.py::TestWorkflows::test_generate_seed_urls[bio] PASSED [ 12%]
extract/tests/integration/banks.py::TestWorkflows::test_generate_seed_urls[eib] PASSED [ 13%]
extract/tests/integration/banks.py::TestWorkflows::test_generate_seed_urls[fmo] PASSED [ 14%]
extract/tests/integration/banks.py::TestWorkflows::test_generate_seed_urls[ifc] PASSED [ 16%]
extract/tests/integration/banks.py::TestWorkflows::test_generate_seed_urls[miga] PASSED [ 17%]
extract/tests/integration/banks.py::TestWorkflows::test_generate_seed_urls[nbim] PASSED [ 18%]
extract/tests/integration/banks.py::TestWorkflows::test_generate_seed_urls[pro] PASSED [ 19%]
extract/tests/integration/banks.py::TestWorkflows::test_generate_seed_urls[undp] PASSED [ 20%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-https://www.adb.org/iati/iati-activities-af.xml] PASSED [ 22%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-http://www.adb.org/iati/iati-activities-fj.xml] PASSED [ 23%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-http://www.adb.org/iati/iati-activities-id.xml] PASSED [ 24%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-http://www.adb.org/iati/iati-activities-in.xml] PASSED [ 25%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-https://www.adb.org/iati/iati-activities-kh.xml] PASSED [ 27%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-http://www.adb.org/iati/iati-activities-mn.xml] PASSED [ 28%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-http://www.adb.org/iati/iati-activities-ph.xml] PASSED [ 29%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-https://www.adb.org/iati/iati-activities-reg.xml] PASSED [ 30%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-http://www.adb.org/iati/iati-activities-uz.xml] PASSED [ 32%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-http://www.adb.org/iati/iati-activities-vn.xml] PASSED [ 33%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[aiib-https://www.aiib.org/en/projects/details/2023/approved/Egypt-Sustainable-Transport-and-Digital-Infrastructure-Guarantee.html] PASSED [ 34%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[aiib-https://www.aiib.org/en/projects/details/2016/approved/Tajikistan-Dushanbe-Uzbekistan-Border-Road-Improvement.html] PASSED [ 35%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[aiib-https://www.aiib.org/en/projects/details/2023/proposed/Viet-Nam-Gia-Lai-Wind-Power-Project.html] PASSED [ 37%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[fmo-https://www.fmo.nl/project-detail/45033] PASSED [ 38%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[fmo-https://www.fmo.nl/project-detail/60377] PASSED [ 39%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[fmo-https://www.fmo.nl/project-detail/62828] PASSED [ 40%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[ifc-https://disclosuresservice.ifc.org/api/searchprovider/searchenterpriseprojects?payload=*&$start=0$srt=disclosed_date$order=desc$rows=100] PASSED [ 41%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[ifc-https://disclosuresservice.ifc.org/api/searchprovider/searchenterpriseprojects?payload=*&$start=200$srt=disclosed_date$order=desc$rows=100] PASSED [ 43%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[ifc-https://disclosuresservice.ifc.org/api/searchprovider/searchenterpriseprojects?payload=*&$start=8300$srt=disclosed_date$order=desc$rows=100] PASSED [ 44%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[ifc-https://disclosuresservice.ifc.org/api/searchprovider/searchenterpriseprojects?payload=*&$start=11800$srt=disclosed_date$order=desc$rows=100] PASSED [ 45%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[ifc-https://disclosuresservice.ifc.org/api/searchprovider/searchenterpriseprojects?payload=*&$start=11900$srt=disclosed_date$order=desc$rows=100] PASSED [ 46%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[miga-https://www.miga.org/project/bboxx-rwanda-kenya-and-drc-0] PASSED [ 48%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[miga-https://www.miga.org/project/dedicated-freight-corridor-corporation-india-limited-1] PASSED [ 49%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[miga-https://www.miga.org/project/koridori-srbije-ltd-morava-motorway-0] PASSED [ 50%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[pro-https://www.proparco.fr/en/carte-des-projets/ecobank-trade-finance] PASSED [ 51%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[pro-https://www.proparco.fr/en/carte-des-projets/darby-latam-pf-iii] PASSED [ 53%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[pro-https://www.proparco.fr/en/carte-des-projets/mpef-iv] PASSED [ 54%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[afdb-https://mapafrica.afdb.org/api/v13/activities/46002-P-MZ-AA0-045/organisations] PASSED [ 55%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[bio-https://www.bio-invest.be/en/investments/banco-guayaquil-1] PASSED [ 56%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[bio-https://www.bio-invest.be/en/investments/cofina-mali] PASSED [ 58%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[bio-https://www.bio-invest.be/en/investments/zoscales-fund-i] PASSED [ 59%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[ebrd-https://www.ebrd.com/home/work-with-us/projects/psd/36582.html] FAILED [ 60%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[ebrd-https://www.ebrd.com/home/work-with-us/projects/psd/56092.html] FAILED [ 61%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[ebrd-https://www.ebrd.com/work-with-us/projects/psd/52642.html] FAILED [ 62%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[ebrd-https://www.ebrd.com/work-with-us/projects/psd/54846.html] FAILED [ 64%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[ebrd-https://www.ebrd.com/work-with-us/projects/psd/technonicol-regional-expansion--resource-efficiency.html] FAILED [ 65%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[eib-https://www.eib.org/en/projects/all/20190714] PASSED [ 66%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[idb-https://www.iadb.org/en/project/BO0060] PASSED [ 67%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[idb-https://www.iadb.org/en/project/CO-T1792] PASSED [ 69%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[idb-https://www.iadb.org/en/project/RG-Q0153] PASSED [ 70%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[undp-https://api.open.undp.org/api/projects/00061970.json] PASSED [ 71%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[undp-https://api.open.undp.org/api/projects/00091070.json] PASSED [ 72%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[undp-https://api.open.undp.org/api/projects/00107513.json] PASSED [ 74%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_multi_scrape[bio-https://www.bio-invest.be/en/investments/p1] PASSED [ 75%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_multi_scrape[bio-https://www.bio-invest.be/en/investments/p17] PASSED [ 76%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_multi_scrape[bio-https://www.bio-invest.be/en/investments/p31] PASSED [ 77%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_multi_scrape[eib-https://www.eib.org/page-provider/projects/list?pageNumber=0&itemPerPage=500&pageable=true&sortColumn=id] PASSED [ 79%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_multi_scrape[eib-https://www.eib.org/page-provider/projects/list?pageNumber=17&itemPerPage=500&pageable=true&sortColumn=id] PASSED [ 80%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_multi_scrape[eib-https://www.eib.org/page-provider/projects/list?pageNumber=32&itemPerPage=500&pageable=true&sortColumn=id] PASSED [ 81%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_multi_scrape[undp-http://open.undp.org/download/iati_xml/Belarus_Republic_of_projects.xml] PASSED [ 82%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_multi_scrape[undp-http://open.undp.org/download/iati_xml/Gambia_projects.xml] PASSED [ 83%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_multi_scrape[undp-https://open.undp.org/download/iati_xml/Niue_projects.xml] PASSED [ 85%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[fmo-https://www.fmo.nl/worldmap?page=1] PASSED [ 86%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[fmo-https://www.fmo.nl/worldmap?page=30] PASSED [ 87%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[fmo-https://www.fmo.nl/worldmap?page=55] PASSED [ 88%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[idb-https://www.iadb.org/en/project-search?page=0] PASSED [ 90%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[idb-https://www.iadb.org/en/project-search?page=400] PASSED [ 91%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[idb-https://www.iadb.org/en/project-search?page=997] PASSED [ 92%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[miga-https://www.miga.org/projects?page=0] PASSED [ 93%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[miga-https://www.miga.org/projects?page=68] PASSED [ 95%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[miga-https://www.miga.org/projects?page=118] PASSED [ 96%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[pro-https://www.proparco.fr/en/projects/list?page=0] PASSED [ 97%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[pro-https://www.proparco.fr/en/projects/list?page=25] PASSED [ 98%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[pro-https://www.proparco.fr/en/projects/list?page=39] PASSED [100%]

=================================== FAILURES ===================================
__________________ TestWorkflows.test_partial_download[ebrd] ___________________

self = <pipeline.extract.tests.integration.banks.TestWorkflows object at 0x7d1fbe2a12d0>
bank = 'ebrd'
data_request_client = <common.http.DataRequestClient object at 0x7d1fbde96450>

    def test_partial_download(
        self, bank: str, data_request_client: DataRequestClient
    ) -> None:
        """Test of the `ProjectPartialDownloadWorkflow`.
    
        Asserts that downloading a data file from
        a URL does not result in an exception.
    
        Args:
            bank: The abbreviation for the bank or
                financial institution (e.g., "AFDB").
    
            data_request_client: An instance of a client used to make HTTP
                requests while rotating headers.
    
        Returns:
            `None`
        """
        workflow: ProjectPartialDownloadWorkflow = WorkflowClassRegistry.get(
            source=bank,
            workflow_type=settings.PROJECT_PARTIAL_DOWNLOAD_WORKFLOW,
            data_request_client=data_request_client,
            msg_queue_client=None,
            db_client=None,
        )
    
        raw_projects = workflow.get_projects()
        raw_projects.to_csv(
            settings.EXTRACT_TEST_RESULT_DIR / f"{bank}_raw_projects.csv",
            index=False,
        )
    
>       urls, clean_projects = workflow.clean_projects(raw_projects)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

extract/tests/integration/banks.py:189: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
extract/workflows/banks/ebrd.py:112: in clean_projects
    df[cols] = df[cols].map(
/app/.venv/lib/python3.11/site-packages/pandas/core/frame.py:10475: in map
    return self.apply(infer).__finalize__(self, "map")
           ^^^^^^^^^^^^^^^^^
/app/.venv/lib/python3.11/site-packages/pandas/core/frame.py:10381: in apply
    return op.apply().__finalize__(self, method="apply")
           ^^^^^^^^^^
/app/.venv/lib/python3.11/site-packages/pandas/core/apply.py:916: in apply
    return self.apply_standard()
           ^^^^^^^^^^^^^^^^^^^^^
/app/.venv/lib/python3.11/site-packages/pandas/core/apply.py:1063: in apply_standard
    results, res_index = self.apply_series_generator()
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/app/.venv/lib/python3.11/site-packages/pandas/core/apply.py:1081: in apply_series_generator
    results[i] = self.func(v, *self.args, **self.kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/app/.venv/lib/python3.11/site-packages/pandas/core/frame.py:10473: in infer
    return x._map_values(func, na_action=na_action)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/app/.venv/lib/python3.11/site-packages/pandas/core/base.py:925: in _map_values
    return algorithms.map_array(arr, mapper, na_action=na_action, convert=convert)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/app/.venv/lib/python3.11/site-packages/pandas/core/algorithms.py:1743: in map_array
    return lib.map_infer(values, mapper, convert=convert)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pandas/_libs/lib.pyx:2999: in pandas._libs.lib.map_infer
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

x = 'Türkiye: A Low Carbon and Climate Resilient Pathway Analysis for the Glass Sector'

>       lambda x: x.encode("raw_unicode_escape").decode("utf-8")
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
E   UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 1: invalid start byte

extract/workflows/banks/ebrd.py:113: UnicodeDecodeError
_ TestWorkflows.test_project_partial_page_scrape[ebrd-https://www.ebrd.com/home/work-with-us/projects/psd/36582.html] _

self = <pipeline.extract.tests.integration.banks.TestWorkflows object at 0x7d1fbe2c7590>
bank = 'ebrd'
project_partial_page_url = 'https://www.ebrd.com/home/work-with-us/projects/psd/36582.html'
data_request_client = <common.http.DataRequestClient object at 0x7d1fb868cad0>

    def test_project_partial_page_scrape(
        self,
        bank: str,
        project_partial_page_url: str,
        data_request_client: DataRequestClient,
    ) -> None:
        """Test of the `ProjectPartialScrapeWorkflow`.
    
        Asserts that scraping a project page for partial
        data does not result in an exception. Sleeps for three
        seconds in between HTTP calls to avoid potential
        throttling by the bank website.
    
        Args:
            bank: The abbreviation for the bank or
                financial institution (e.g., "AFDB").
    
            project_partial_page_url: A URL to a
                project page on the bank's website.
    
            data_request_client: An instance of a client used to make HTTP
                requests while rotating headers.
    
        Returns:
            `None`
        """
>       workflow: ProjectPartialScrapeWorkflow = WorkflowClassRegistry.get(
            source=bank,
            workflow_type=settings.PROJECT_PARTIAL_PAGE_WORKFLOW,
            data_request_client=data_request_client,
            msg_queue_client=None,
            db_client=None,
        )

extract/tests/integration/banks.py:296: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
extract/workflows/registry.py:206: in get
    return workflow_cls(**params)
           ^^^^^^^^^^^^^^^^^^^^^^
extract/workflows/banks/ebrd.py:167: in __init__
    self._gemini_client = genai.Client(api_key=api_key)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/app/.venv/lib/python3.11/site-packages/google/genai/client.py:219: in __init__
    self._api_client = self._get_api_client(
/app/.venv/lib/python3.11/site-packages/google/genai/client.py:265: in _get_api_client
    return BaseApiClient(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <google.genai._api_client.BaseApiClient object at 0x7d1fba97f950>
vertexai = None, api_key = '', credentials = None, project = None
location = None, http_options = None

    def __init__(
        self,
        vertexai: Optional[bool] = None,
        api_key: Optional[str] = None,
        credentials: Optional[google.auth.credentials.Credentials] = None,
        project: Optional[str] = None,
        location: Optional[str] = None,
        http_options: Optional[HttpOptionsOrDict] = None,
    ):
      self.vertexai = vertexai
      if self.vertexai is None:
        if os.environ.get('GOOGLE_GENAI_USE_VERTEXAI', '0').lower() in [
            'true',
            '1',
        ]:
          self.vertexai = True
    
      # Validate explicitly set initializer values.
      if (project or location) and api_key:
        # API cannot consume both project/location and api_key.
        raise ValueError(
            'Project/location and API key are mutually exclusive in the client'
            ' initializer.'
        )
      elif credentials and api_key:
        # API cannot consume both credentials and api_key.
        raise ValueError(
            'Credentials and API key are mutually exclusive in the client'
            ' initializer.'
        )
    
      # Validate http_options if it is provided.
      validated_http_options = HttpOptions()
      if isinstance(http_options, dict):
        try:
          validated_http_options = HttpOptions.model_validate(http_options)
        except ValidationError as e:
          raise ValueError('Invalid http_options') from e
      elif isinstance(http_options, HttpOptions):
        validated_http_options = http_options
    
      # Retrieve implicitly set values from the environment.
      env_project = os.environ.get('GOOGLE_CLOUD_PROJECT', None)
      env_location = os.environ.get('GOOGLE_CLOUD_LOCATION', None)
      env_api_key = get_env_api_key()
      self.project = project or env_project
      self.location = location or env_location
      self.api_key = api_key or env_api_key
    
      self._credentials = credentials
      self._http_options = HttpOptions()
      # Initialize the lock. This lock will be used to protect access to the
      # credentials. This is crucial for thread safety when multiple coroutines
      # might be accessing the credentials at the same time.
      try:
        self._sync_auth_lock = threading.Lock()
        self._async_auth_lock = asyncio.Lock()
      except RuntimeError:
        asyncio.set_event_loop(asyncio.new_event_loop())
        self._sync_auth_lock = threading.Lock()
        self._async_auth_lock = asyncio.Lock()
    
      # Handle when to use Vertex AI in express mode (api key).
      # Explicit initializer arguments are already validated above.
      if self.vertexai:
        if credentials:
          # Explicit credentials take precedence over implicit api_key.
          logger.info(
              'The user provided Google Cloud credentials will take precedence'
              + ' over the API key from the environment variable.'
          )
          self.api_key = None
        elif (env_location or env_project) and api_key:
          # Explicit api_key takes precedence over implicit project/location.
          logger.info(
              'The user provided Vertex AI API key will take precedence over the'
              + ' project/location from the environment variables.'
          )
          self.project = None
          self.location = None
        elif (project or location) and env_api_key:
          # Explicit project/location takes precedence over implicit api_key.
          logger.info(
              'The user provided project/location will take precedence over the'
              + ' Vertex AI API key from the environment variable.'
          )
          self.api_key = None
        elif (env_location or env_project) and env_api_key:
          # Implicit project/location takes precedence over implicit api_key.
          logger.info(
              'The project/location from the environment variables will take'
              + ' precedence over the API key from the environment variables.'
          )
          self.api_key = None
    
        # Skip fetching project from ADC if base url is provided in http options.
        if (
            not self.project
            and not self.api_key
            and not validated_http_options.base_url
        ):
          credentials, self.project = load_auth(project=None)
          if not self._credentials:
            self._credentials = credentials
    
        has_sufficient_auth = (self.project and self.location) or self.api_key
    
        if not has_sufficient_auth and not validated_http_options.base_url:
          # Skip sufficient auth check if base url is provided in http options.
          raise ValueError(
              'Project and location or API key must be set when using the Vertex '
              'AI API.'
          )
        if self.api_key or self.location == 'global':
          self._http_options.base_url = f'https://aiplatform.googleapis.com/'
        elif validated_http_options.base_url and not has_sufficient_auth:
          # Avoid setting default base url and api version if base_url provided.
          self._http_options.base_url = validated_http_options.base_url
        else:
          self._http_options.base_url = (
              f'https://{self.location}-aiplatform.googleapis.com/'
          )
        self._http_options.api_version = 'v1beta1'
      else:  # Implicit initialization or missing arguments.
        if not self.api_key:
>         raise ValueError(
              'Missing key inputs argument! To use the Google AI API,'
              ' provide (`api_key`) arguments. To use the Google Cloud API,'
              ' provide (`vertexai`, `project` & `location`) arguments.'
          )
E         ValueError: Missing key inputs argument! To use the Google AI API, provide (`api_key`) arguments. To use the Google Cloud API, provide (`vertexai`, `project` & `location`) arguments.

/app/.venv/lib/python3.11/site-packages/google/genai/_api_client.py:658: ValueError
_ TestWorkflows.test_project_partial_page_scrape[ebrd-https://www.ebrd.com/home/work-with-us/projects/psd/56092.html] _

self = <pipeline.extract.tests.integration.banks.TestWorkflows object at 0x7d1fbe2c7210>
bank = 'ebrd'
project_partial_page_url = 'https://www.ebrd.com/home/work-with-us/projects/psd/56092.html'
data_request_client = <common.http.DataRequestClient object at 0x7d1fbdccb110>

    def test_project_partial_page_scrape(
        self,
        bank: str,
        project_partial_page_url: str,
        data_request_client: DataRequestClient,
    ) -> None:
        """Test of the `ProjectPartialScrapeWorkflow`.
    
        Asserts that scraping a project page for partial
        data does not result in an exception. Sleeps for three
        seconds in between HTTP calls to avoid potential
        throttling by the bank website.
    
        Args:
            bank: The abbreviation for the bank or
                financial institution (e.g., "AFDB").
    
            project_partial_page_url: A URL to a
                project page on the bank's website.
    
            data_request_client: An instance of a client used to make HTTP
                requests while rotating headers.
    
        Returns:
            `None`
        """
>       workflow: ProjectPartialScrapeWorkflow = WorkflowClassRegistry.get(
            source=bank,
            workflow_type=settings.PROJECT_PARTIAL_PAGE_WORKFLOW,
            data_request_client=data_request_client,
            msg_queue_client=None,
            db_client=None,
        )

extract/tests/integration/banks.py:296: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
extract/workflows/registry.py:206: in get
    return workflow_cls(**params)
           ^^^^^^^^^^^^^^^^^^^^^^
extract/workflows/banks/ebrd.py:167: in __init__
    self._gemini_client = genai.Client(api_key=api_key)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/app/.venv/lib/python3.11/site-packages/google/genai/client.py:219: in __init__
    self._api_client = self._get_api_client(
/app/.venv/lib/python3.11/site-packages/google/genai/client.py:265: in _get_api_client
    return BaseApiClient(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <google.genai._api_client.BaseApiClient object at 0x7d1fbdcc98d0>
vertexai = None, api_key = '', credentials = None, project = None
location = None, http_options = None

    def __init__(
        self,
        vertexai: Optional[bool] = None,
        api_key: Optional[str] = None,
        credentials: Optional[google.auth.credentials.Credentials] = None,
        project: Optional[str] = None,
        location: Optional[str] = None,
        http_options: Optional[HttpOptionsOrDict] = None,
    ):
      self.vertexai = vertexai
      if self.vertexai is None:
        if os.environ.get('GOOGLE_GENAI_USE_VERTEXAI', '0').lower() in [
            'true',
            '1',
        ]:
          self.vertexai = True
    
      # Validate explicitly set initializer values.
      if (project or location) and api_key:
        # API cannot consume both project/location and api_key.
        raise ValueError(
            'Project/location and API key are mutually exclusive in the client'
            ' initializer.'
        )
      elif credentials and api_key:
        # API cannot consume both credentials and api_key.
        raise ValueError(
            'Credentials and API key are mutually exclusive in the client'
            ' initializer.'
        )
    
      # Validate http_options if it is provided.
      validated_http_options = HttpOptions()
      if isinstance(http_options, dict):
        try:
          validated_http_options = HttpOptions.model_validate(http_options)
        except ValidationError as e:
          raise ValueError('Invalid http_options') from e
      elif isinstance(http_options, HttpOptions):
        validated_http_options = http_options
    
      # Retrieve implicitly set values from the environment.
      env_project = os.environ.get('GOOGLE_CLOUD_PROJECT', None)
      env_location = os.environ.get('GOOGLE_CLOUD_LOCATION', None)
      env_api_key = get_env_api_key()
      self.project = project or env_project
      self.location = location or env_location
      self.api_key = api_key or env_api_key
    
      self._credentials = credentials
      self._http_options = HttpOptions()
      # Initialize the lock. This lock will be used to protect access to the
      # credentials. This is crucial for thread safety when multiple coroutines
      # might be accessing the credentials at the same time.
      try:
        self._sync_auth_lock = threading.Lock()
        self._async_auth_lock = asyncio.Lock()
      except RuntimeError:
        asyncio.set_event_loop(asyncio.new_event_loop())
        self._sync_auth_lock = threading.Lock()
        self._async_auth_lock = asyncio.Lock()
    
      # Handle when to use Vertex AI in express mode (api key).
      # Explicit initializer arguments are already validated above.
      if self.vertexai:
        if credentials:
          # Explicit credentials take precedence over implicit api_key.
          logger.info(
              'The user provided Google Cloud credentials will take precedence'
              + ' over the API key from the environment variable.'
          )
          self.api_key = None
        elif (env_location or env_project) and api_key:
          # Explicit api_key takes precedence over implicit project/location.
          logger.info(
              'The user provided Vertex AI API key will take precedence over the'
              + ' project/location from the environment variables.'
          )
          self.project = None
          self.location = None
        elif (project or location) and env_api_key:
          # Explicit project/location takes precedence over implicit api_key.
          logger.info(
              'The user provided project/location will take precedence over the'
              + ' Vertex AI API key from the environment variable.'
          )
          self.api_key = None
        elif (env_location or env_project) and env_api_key:
          # Implicit project/location takes precedence over implicit api_key.
          logger.info(
              'The project/location from the environment variables will take'
              + ' precedence over the API key from the environment variables.'
          )
          self.api_key = None
    
        # Skip fetching project from ADC if base url is provided in http options.
        if (
            not self.project
            and not self.api_key
            and not validated_http_options.base_url
        ):
          credentials, self.project = load_auth(project=None)
          if not self._credentials:
            self._credentials = credentials
    
        has_sufficient_auth = (self.project and self.location) or self.api_key
    
        if not has_sufficient_auth and not validated_http_options.base_url:
          # Skip sufficient auth check if base url is provided in http options.
          raise ValueError(
              'Project and location or API key must be set when using the Vertex '
              'AI API.'
          )
        if self.api_key or self.location == 'global':
          self._http_options.base_url = f'https://aiplatform.googleapis.com/'
        elif validated_http_options.base_url and not has_sufficient_auth:
          # Avoid setting default base url and api version if base_url provided.
          self._http_options.base_url = validated_http_options.base_url
        else:
          self._http_options.base_url = (
              f'https://{self.location}-aiplatform.googleapis.com/'
          )
        self._http_options.api_version = 'v1beta1'
      else:  # Implicit initialization or missing arguments.
        if not self.api_key:
>         raise ValueError(
              'Missing key inputs argument! To use the Google AI API,'
              ' provide (`api_key`) arguments. To use the Google Cloud API,'
              ' provide (`vertexai`, `project` & `location`) arguments.'
          )
E         ValueError: Missing key inputs argument! To use the Google AI API, provide (`api_key`) arguments. To use the Google Cloud API, provide (`vertexai`, `project` & `location`) arguments.

/app/.venv/lib/python3.11/site-packages/google/genai/_api_client.py:658: ValueError
_ TestWorkflows.test_project_partial_page_scrape[ebrd-https://www.ebrd.com/work-with-us/projects/psd/52642.html] _

self = <pipeline.extract.tests.integration.banks.TestWorkflows object at 0x7d1fbe2ed350>
bank = 'ebrd'
project_partial_page_url = 'https://www.ebrd.com/work-with-us/projects/psd/52642.html'
data_request_client = <common.http.DataRequestClient object at 0x7d1fbe1070d0>

    def test_project_partial_page_scrape(
        self,
        bank: str,
        project_partial_page_url: str,
        data_request_client: DataRequestClient,
    ) -> None:
        """Test of the `ProjectPartialScrapeWorkflow`.
    
        Asserts that scraping a project page for partial
        data does not result in an exception. Sleeps for three
        seconds in between HTTP calls to avoid potential
        throttling by the bank website.
    
        Args:
            bank: The abbreviation for the bank or
                financial institution (e.g., "AFDB").
    
            project_partial_page_url: A URL to a
                project page on the bank's website.
    
            data_request_client: An instance of a client used to make HTTP
                requests while rotating headers.
    
        Returns:
            `None`
        """
>       workflow: ProjectPartialScrapeWorkflow = WorkflowClassRegistry.get(
            source=bank,
            workflow_type=settings.PROJECT_PARTIAL_PAGE_WORKFLOW,
            data_request_client=data_request_client,
            msg_queue_client=None,
            db_client=None,
        )

extract/tests/integration/banks.py:296: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
extract/workflows/registry.py:206: in get
    return workflow_cls(**params)
           ^^^^^^^^^^^^^^^^^^^^^^
extract/workflows/banks/ebrd.py:167: in __init__
    self._gemini_client = genai.Client(api_key=api_key)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/app/.venv/lib/python3.11/site-packages/google/genai/client.py:219: in __init__
    self._api_client = self._get_api_client(
/app/.venv/lib/python3.11/site-packages/google/genai/client.py:265: in _get_api_client
    return BaseApiClient(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <google.genai._api_client.BaseApiClient object at 0x7d1fbe104150>
vertexai = None, api_key = '', credentials = None, project = None
location = None, http_options = None

    def __init__(
        self,
        vertexai: Optional[bool] = None,
        api_key: Optional[str] = None,
        credentials: Optional[google.auth.credentials.Credentials] = None,
        project: Optional[str] = None,
        location: Optional[str] = None,
        http_options: Optional[HttpOptionsOrDict] = None,
    ):
      self.vertexai = vertexai
      if self.vertexai is None:
        if os.environ.get('GOOGLE_GENAI_USE_VERTEXAI', '0').lower() in [
            'true',
            '1',
        ]:
          self.vertexai = True
    
      # Validate explicitly set initializer values.
      if (project or location) and api_key:
        # API cannot consume both project/location and api_key.
        raise ValueError(
            'Project/location and API key are mutually exclusive in the client'
            ' initializer.'
        )
      elif credentials and api_key:
        # API cannot consume both credentials and api_key.
        raise ValueError(
            'Credentials and API key are mutually exclusive in the client'
            ' initializer.'
        )
    
      # Validate http_options if it is provided.
      validated_http_options = HttpOptions()
      if isinstance(http_options, dict):
        try:
          validated_http_options = HttpOptions.model_validate(http_options)
        except ValidationError as e:
          raise ValueError('Invalid http_options') from e
      elif isinstance(http_options, HttpOptions):
        validated_http_options = http_options
    
      # Retrieve implicitly set values from the environment.
      env_project = os.environ.get('GOOGLE_CLOUD_PROJECT', None)
      env_location = os.environ.get('GOOGLE_CLOUD_LOCATION', None)
      env_api_key = get_env_api_key()
      self.project = project or env_project
      self.location = location or env_location
      self.api_key = api_key or env_api_key
    
      self._credentials = credentials
      self._http_options = HttpOptions()
      # Initialize the lock. This lock will be used to protect access to the
      # credentials. This is crucial for thread safety when multiple coroutines
      # might be accessing the credentials at the same time.
      try:
        self._sync_auth_lock = threading.Lock()
        self._async_auth_lock = asyncio.Lock()
      except RuntimeError:
        asyncio.set_event_loop(asyncio.new_event_loop())
        self._sync_auth_lock = threading.Lock()
        self._async_auth_lock = asyncio.Lock()
    
      # Handle when to use Vertex AI in express mode (api key).
      # Explicit initializer arguments are already validated above.
      if self.vertexai:
        if credentials:
          # Explicit credentials take precedence over implicit api_key.
          logger.info(
              'The user provided Google Cloud credentials will take precedence'
              + ' over the API key from the environment variable.'
          )
          self.api_key = None
        elif (env_location or env_project) and api_key:
          # Explicit api_key takes precedence over implicit project/location.
          logger.info(
              'The user provided Vertex AI API key will take precedence over the'
              + ' project/location from the environment variables.'
          )
          self.project = None
          self.location = None
        elif (project or location) and env_api_key:
          # Explicit project/location takes precedence over implicit api_key.
          logger.info(
              'The user provided project/location will take precedence over the'
              + ' Vertex AI API key from the environment variable.'
          )
          self.api_key = None
        elif (env_location or env_project) and env_api_key:
          # Implicit project/location takes precedence over implicit api_key.
          logger.info(
              'The project/location from the environment variables will take'
              + ' precedence over the API key from the environment variables.'
          )
          self.api_key = None
    
        # Skip fetching project from ADC if base url is provided in http options.
        if (
            not self.project
            and not self.api_key
            and not validated_http_options.base_url
        ):
          credentials, self.project = load_auth(project=None)
          if not self._credentials:
            self._credentials = credentials
    
        has_sufficient_auth = (self.project and self.location) or self.api_key
    
        if not has_sufficient_auth and not validated_http_options.base_url:
          # Skip sufficient auth check if base url is provided in http options.
          raise ValueError(
              'Project and location or API key must be set when using the Vertex '
              'AI API.'
          )
        if self.api_key or self.location == 'global':
          self._http_options.base_url = f'https://aiplatform.googleapis.com/'
        elif validated_http_options.base_url and not has_sufficient_auth:
          # Avoid setting default base url and api version if base_url provided.
          self._http_options.base_url = validated_http_options.base_url
        else:
          self._http_options.base_url = (
              f'https://{self.location}-aiplatform.googleapis.com/'
          )
        self._http_options.api_version = 'v1beta1'
      else:  # Implicit initialization or missing arguments.
        if not self.api_key:
>         raise ValueError(
              'Missing key inputs argument! To use the Google AI API,'
              ' provide (`api_key`) arguments. To use the Google Cloud API,'
              ' provide (`vertexai`, `project` & `location`) arguments.'
          )
E         ValueError: Missing key inputs argument! To use the Google AI API, provide (`api_key`) arguments. To use the Google Cloud API, provide (`vertexai`, `project` & `location`) arguments.

/app/.venv/lib/python3.11/site-packages/google/genai/_api_client.py:658: ValueError
_ TestWorkflows.test_project_partial_page_scrape[ebrd-https://www.ebrd.com/work-with-us/projects/psd/54846.html] _

self = <pipeline.extract.tests.integration.banks.TestWorkflows object at 0x7d1fbe2ed150>
bank = 'ebrd'
project_partial_page_url = 'https://www.ebrd.com/work-with-us/projects/psd/54846.html'
data_request_client = <common.http.DataRequestClient object at 0x7d1fb93a5bd0>

    def test_project_partial_page_scrape(
        self,
        bank: str,
        project_partial_page_url: str,
        data_request_client: DataRequestClient,
    ) -> None:
        """Test of the `ProjectPartialScrapeWorkflow`.
    
        Asserts that scraping a project page for partial
        data does not result in an exception. Sleeps for three
        seconds in between HTTP calls to avoid potential
        throttling by the bank website.
    
        Args:
            bank: The abbreviation for the bank or
                financial institution (e.g., "AFDB").
    
            project_partial_page_url: A URL to a
                project page on the bank's website.
    
            data_request_client: An instance of a client used to make HTTP
                requests while rotating headers.
    
        Returns:
            `None`
        """
>       workflow: ProjectPartialScrapeWorkflow = WorkflowClassRegistry.get(
            source=bank,
            workflow_type=settings.PROJECT_PARTIAL_PAGE_WORKFLOW,
            data_request_client=data_request_client,
            msg_queue_client=None,
            db_client=None,
        )

extract/tests/integration/banks.py:296: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
extract/workflows/registry.py:206: in get
    return workflow_cls(**params)
           ^^^^^^^^^^^^^^^^^^^^^^
extract/workflows/banks/ebrd.py:167: in __init__
    self._gemini_client = genai.Client(api_key=api_key)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/app/.venv/lib/python3.11/site-packages/google/genai/client.py:219: in __init__
    self._api_client = self._get_api_client(
/app/.venv/lib/python3.11/site-packages/google/genai/client.py:265: in _get_api_client
    return BaseApiClient(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <google.genai._api_client.BaseApiClient object at 0x7d1fb93a5b90>
vertexai = None, api_key = '', credentials = None, project = None
location = None, http_options = None

    def __init__(
        self,
        vertexai: Optional[bool] = None,
        api_key: Optional[str] = None,
        credentials: Optional[google.auth.credentials.Credentials] = None,
        project: Optional[str] = None,
        location: Optional[str] = None,
        http_options: Optional[HttpOptionsOrDict] = None,
    ):
      self.vertexai = vertexai
      if self.vertexai is None:
        if os.environ.get('GOOGLE_GENAI_USE_VERTEXAI', '0').lower() in [
            'true',
            '1',
        ]:
          self.vertexai = True
    
      # Validate explicitly set initializer values.
      if (project or location) and api_key:
        # API cannot consume both project/location and api_key.
        raise ValueError(
            'Project/location and API key are mutually exclusive in the client'
            ' initializer.'
        )
      elif credentials and api_key:
        # API cannot consume both credentials and api_key.
        raise ValueError(
            'Credentials and API key are mutually exclusive in the client'
            ' initializer.'
        )
    
      # Validate http_options if it is provided.
      validated_http_options = HttpOptions()
      if isinstance(http_options, dict):
        try:
          validated_http_options = HttpOptions.model_validate(http_options)
        except ValidationError as e:
          raise ValueError('Invalid http_options') from e
      elif isinstance(http_options, HttpOptions):
        validated_http_options = http_options
    
      # Retrieve implicitly set values from the environment.
      env_project = os.environ.get('GOOGLE_CLOUD_PROJECT', None)
      env_location = os.environ.get('GOOGLE_CLOUD_LOCATION', None)
      env_api_key = get_env_api_key()
      self.project = project or env_project
      self.location = location or env_location
      self.api_key = api_key or env_api_key
    
      self._credentials = credentials
      self._http_options = HttpOptions()
      # Initialize the lock. This lock will be used to protect access to the
      # credentials. This is crucial for thread safety when multiple coroutines
      # might be accessing the credentials at the same time.
      try:
        self._sync_auth_lock = threading.Lock()
        self._async_auth_lock = asyncio.Lock()
      except RuntimeError:
        asyncio.set_event_loop(asyncio.new_event_loop())
        self._sync_auth_lock = threading.Lock()
        self._async_auth_lock = asyncio.Lock()
    
      # Handle when to use Vertex AI in express mode (api key).
      # Explicit initializer arguments are already validated above.
      if self.vertexai:
        if credentials:
          # Explicit credentials take precedence over implicit api_key.
          logger.info(
              'The user provided Google Cloud credentials will take precedence'
              + ' over the API key from the environment variable.'
          )
          self.api_key = None
        elif (env_location or env_project) and api_key:
          # Explicit api_key takes precedence over implicit project/location.
          logger.info(
              'The user provided Vertex AI API key will take precedence over the'
              + ' project/location from the environment variables.'
          )
          self.project = None
          self.location = None
        elif (project or location) and env_api_key:
          # Explicit project/location takes precedence over implicit api_key.
          logger.info(
              'The user provided project/location will take precedence over the'
              + ' Vertex AI API key from the environment variable.'
          )
          self.api_key = None
        elif (env_location or env_project) and env_api_key:
          # Implicit project/location takes precedence over implicit api_key.
          logger.info(
              'The project/location from the environment variables will take'
              + ' precedence over the API key from the environment variables.'
          )
          self.api_key = None
    
        # Skip fetching project from ADC if base url is provided in http options.
        if (
            not self.project
            and not self.api_key
            and not validated_http_options.base_url
        ):
          credentials, self.project = load_auth(project=None)
          if not self._credentials:
            self._credentials = credentials
    
        has_sufficient_auth = (self.project and self.location) or self.api_key
    
        if not has_sufficient_auth and not validated_http_options.base_url:
          # Skip sufficient auth check if base url is provided in http options.
          raise ValueError(
              'Project and location or API key must be set when using the Vertex '
              'AI API.'
          )
        if self.api_key or self.location == 'global':
          self._http_options.base_url = f'https://aiplatform.googleapis.com/'
        elif validated_http_options.base_url and not has_sufficient_auth:
          # Avoid setting default base url and api version if base_url provided.
          self._http_options.base_url = validated_http_options.base_url
        else:
          self._http_options.base_url = (
              f'https://{self.location}-aiplatform.googleapis.com/'
          )
        self._http_options.api_version = 'v1beta1'
      else:  # Implicit initialization or missing arguments.
        if not self.api_key:
>         raise ValueError(
              'Missing key inputs argument! To use the Google AI API,'
              ' provide (`api_key`) arguments. To use the Google Cloud API,'
              ' provide (`vertexai`, `project` & `location`) arguments.'
          )
E         ValueError: Missing key inputs argument! To use the Google AI API, provide (`api_key`) arguments. To use the Google Cloud API, provide (`vertexai`, `project` & `location`) arguments.

/app/.venv/lib/python3.11/site-packages/google/genai/_api_client.py:658: ValueError
_ TestWorkflows.test_project_partial_page_scrape[ebrd-https://www.ebrd.com/work-with-us/projects/psd/technonicol-regional-expansion--resource-efficiency.html] _

self = <pipeline.extract.tests.integration.banks.TestWorkflows object at 0x7d1fbe2ece90>
bank = 'ebrd'
project_partial_page_url = 'https://www.ebrd.com/work-with-us/projects/psd/technonicol-regional-expansion--resource-efficiency.html'
data_request_client = <common.http.DataRequestClient object at 0x7d1fbdc40350>

    def test_project_partial_page_scrape(
        self,
        bank: str,
        project_partial_page_url: str,
        data_request_client: DataRequestClient,
    ) -> None:
        """Test of the `ProjectPartialScrapeWorkflow`.
    
        Asserts that scraping a project page for partial
        data does not result in an exception. Sleeps for three
        seconds in between HTTP calls to avoid potential
        throttling by the bank website.
    
        Args:
            bank: The abbreviation for the bank or
                financial institution (e.g., "AFDB").
    
            project_partial_page_url: A URL to a
                project page on the bank's website.
    
            data_request_client: An instance of a client used to make HTTP
                requests while rotating headers.
    
        Returns:
            `None`
        """
>       workflow: ProjectPartialScrapeWorkflow = WorkflowClassRegistry.get(
            source=bank,
            workflow_type=settings.PROJECT_PARTIAL_PAGE_WORKFLOW,
            data_request_client=data_request_client,
            msg_queue_client=None,
            db_client=None,
        )

extract/tests/integration/banks.py:296: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
extract/workflows/registry.py:206: in get
    return workflow_cls(**params)
           ^^^^^^^^^^^^^^^^^^^^^^
extract/workflows/banks/ebrd.py:167: in __init__
    self._gemini_client = genai.Client(api_key=api_key)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/app/.venv/lib/python3.11/site-packages/google/genai/client.py:219: in __init__
    self._api_client = self._get_api_client(
/app/.venv/lib/python3.11/site-packages/google/genai/client.py:265: in _get_api_client
    return BaseApiClient(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <google.genai._api_client.BaseApiClient object at 0x7d1fbdc42b50>
vertexai = None, api_key = '', credentials = None, project = None
location = None, http_options = None

    def __init__(
        self,
        vertexai: Optional[bool] = None,
        api_key: Optional[str] = None,
        credentials: Optional[google.auth.credentials.Credentials] = None,
        project: Optional[str] = None,
        location: Optional[str] = None,
        http_options: Optional[HttpOptionsOrDict] = None,
    ):
      self.vertexai = vertexai
      if self.vertexai is None:
        if os.environ.get('GOOGLE_GENAI_USE_VERTEXAI', '0').lower() in [
            'true',
            '1',
        ]:
          self.vertexai = True
    
      # Validate explicitly set initializer values.
      if (project or location) and api_key:
        # API cannot consume both project/location and api_key.
        raise ValueError(
            'Project/location and API key are mutually exclusive in the client'
            ' initializer.'
        )
      elif credentials and api_key:
        # API cannot consume both credentials and api_key.
        raise ValueError(
            'Credentials and API key are mutually exclusive in the client'
            ' initializer.'
        )
    
      # Validate http_options if it is provided.
      validated_http_options = HttpOptions()
      if isinstance(http_options, dict):
        try:
          validated_http_options = HttpOptions.model_validate(http_options)
        except ValidationError as e:
          raise ValueError('Invalid http_options') from e
      elif isinstance(http_options, HttpOptions):
        validated_http_options = http_options
    
      # Retrieve implicitly set values from the environment.
      env_project = os.environ.get('GOOGLE_CLOUD_PROJECT', None)
      env_location = os.environ.get('GOOGLE_CLOUD_LOCATION', None)
      env_api_key = get_env_api_key()
      self.project = project or env_project
      self.location = location or env_location
      self.api_key = api_key or env_api_key
    
      self._credentials = credentials
      self._http_options = HttpOptions()
      # Initialize the lock. This lock will be used to protect access to the
      # credentials. This is crucial for thread safety when multiple coroutines
      # might be accessing the credentials at the same time.
      try:
        self._sync_auth_lock = threading.Lock()
        self._async_auth_lock = asyncio.Lock()
      except RuntimeError:
        asyncio.set_event_loop(asyncio.new_event_loop())
        self._sync_auth_lock = threading.Lock()
        self._async_auth_lock = asyncio.Lock()
    
      # Handle when to use Vertex AI in express mode (api key).
      # Explicit initializer arguments are already validated above.
      if self.vertexai:
        if credentials:
          # Explicit credentials take precedence over implicit api_key.
          logger.info(
              'The user provided Google Cloud credentials will take precedence'
              + ' over the API key from the environment variable.'
          )
          self.api_key = None
        elif (env_location or env_project) and api_key:
          # Explicit api_key takes precedence over implicit project/location.
          logger.info(
              'The user provided Vertex AI API key will take precedence over the'
              + ' project/location from the environment variables.'
          )
          self.project = None
          self.location = None
        elif (project or location) and env_api_key:
          # Explicit project/location takes precedence over implicit api_key.
          logger.info(
              'The user provided project/location will take precedence over the'
              + ' Vertex AI API key from the environment variable.'
          )
          self.api_key = None
        elif (env_location or env_project) and env_api_key:
          # Implicit project/location takes precedence over implicit api_key.
          logger.info(
              'The project/location from the environment variables will take'
              + ' precedence over the API key from the environment variables.'
          )
          self.api_key = None
    
        # Skip fetching project from ADC if base url is provided in http options.
        if (
            not self.project
            and not self.api_key
            and not validated_http_options.base_url
        ):
          credentials, self.project = load_auth(project=None)
          if not self._credentials:
            self._credentials = credentials
    
        has_sufficient_auth = (self.project and self.location) or self.api_key
    
        if not has_sufficient_auth and not validated_http_options.base_url:
          # Skip sufficient auth check if base url is provided in http options.
          raise ValueError(
              'Project and location or API key must be set when using the Vertex '
              'AI API.'
          )
        if self.api_key or self.location == 'global':
          self._http_options.base_url = f'https://aiplatform.googleapis.com/'
        elif validated_http_options.base_url and not has_sufficient_auth:
          # Avoid setting default base url and api version if base_url provided.
          self._http_options.base_url = validated_http_options.base_url
        else:
          self._http_options.base_url = (
              f'https://{self.location}-aiplatform.googleapis.com/'
          )
        self._http_options.api_version = 'v1beta1'
      else:  # Implicit initialization or missing arguments.
        if not self.api_key:
>         raise ValueError(
              'Missing key inputs argument! To use the Google AI API,'
              ' provide (`api_key`) arguments. To use the Google Cloud API,'
              ' provide (`vertexai`, `project` & `location`) arguments.'
          )
E         ValueError: Missing key inputs argument! To use the Google AI API, provide (`api_key`) arguments. To use the Google Cloud API, provide (`vertexai`, `project` & `location`) arguments.

/app/.venv/lib/python3.11/site-packages/google/genai/_api_client.py:658: ValueError
=============================== warnings summary ===============================
../../.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:73
  /app/.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:73: DeprecationWarning: invalid escape sequence '\('
    'Digit9': {'keyCode': 57, 'code': 'Digit9', 'shiftKey': '\(', 'key': '9'},

../../.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:143
  /app/.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:143: DeprecationWarning: invalid escape sequence '\<'
    'Comma': {'keyCode': 188, 'code': 'Comma', 'shiftKey': '\<', 'key': ','},

../../.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:247
  /app/.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:247: DeprecationWarning: invalid escape sequence '\<'
    '<': {'keyCode': 188, 'key': '\<', 'code': 'Comma'},

src/pipeline/extract/tests/integration/banks.py::TestWorkflows::test_download[deg]
  /app/src/pipeline/extract/workflows/banks/deg.py:237: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
    df[cols] = df[cols].replace({None: ""})

src/pipeline/extract/tests/integration/banks.py::TestWorkflows::test_download[kfw]
  /app/src/pipeline/extract/workflows/banks/kfw.py:182: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
    df[cols] = df[cols].replace({None: ""})

src/pipeline/extract/tests/integration/banks.py::TestWorkflows::test_download[wb]
  /app/src/pipeline/extract/workflows/banks/wb.py:141: SettingWithCopyWarning: 
  A value is trying to be set on a copy of a slice from a DataFrame.
  Try using .loc[row_indexer,col_indexer] = value instead
  
  See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
    financers_df["Financers"] = financers_df["Name"].str.upper()

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED extract/tests/integration/banks.py::TestWorkflows::test_partial_download[ebrd] - UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 1: invalid start byte
FAILED extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[ebrd-https://www.ebrd.com/home/work-with-us/projects/psd/36582.html] - ValueError: Missing key inputs argument! To use the Google AI API, provide (`api_key`) arguments. To use the Google Cloud API, provide (`vertexai`, `project` & `location`) arguments.
FAILED extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[ebrd-https://www.ebrd.com/home/work-with-us/projects/psd/56092.html] - ValueError: Missing key inputs argument! To use the Google AI API, provide (`api_key`) arguments. To use the Google Cloud API, provide (`vertexai`, `project` & `location`) arguments.
FAILED extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[ebrd-https://www.ebrd.com/work-with-us/projects/psd/52642.html] - ValueError: Missing key inputs argument! To use the Google AI API, provide (`api_key`) arguments. To use the Google Cloud API, provide (`vertexai`, `project` & `location`) arguments.
FAILED extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[ebrd-https://www.ebrd.com/work-with-us/projects/psd/54846.html] - ValueError: Missing key inputs argument! To use the Google AI API, provide (`api_key`) arguments. To use the Google Cloud API, provide (`vertexai`, `project` & `location`) arguments.
FAILED extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[ebrd-https://www.ebrd.com/work-with-us/projects/psd/technonicol-regional-expansion--resource-efficiency.html] - ValueError: Missing key inputs argument! To use the Google AI API, provide (`api_key`) arguments. To use the Google Cloud API, provide (`vertexai`, `project` & `location`) arguments.
============= 6 failed, 75 passed, 6 warnings in 410.49s (0:06:50) =============

@jpivarski

Copy link
Copy Markdown
Author

Just test the failing ones

docker run --rm -v "$PWD/services/extract/src:/app/src" --env-file "$PWD/services/extract/.env.dev" debit-scrapers bash -lc "cd /app/src/pipeline && uv run pytest './extract/tests/integration/banks.py::TestWorkflows::test_partial_download[ebrd]' './extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[ebrd-https://www.ebrd.com/home/work-with-us/projects/psd/36582.html]' './extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[ebrd-https://www.ebrd.com/home/work-with-us/projects/psd/56092.html]' './extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[ebrd-https://www.ebrd.com/work-with-us/projects/psd/52642.html]' './extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[ebrd-https://www.ebrd.com/work-with-us/projects/psd/54846.html]' './extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[ebrd-https://www.ebrd.com/work-with-us/projects/psd/technonicol-regional-expansion--resource-efficiency.html]' -vv"

After fix

The first of these tests was a Unicode decoding error. The others require a GEMINI_API_KEY. I used my Google account to generate a key on the free tier and used that. Now just the six previously-failing tests are:

Command output
warning: The `tool.uv.dev-dependencies` field (used in `/app/pyproject.toml`) is deprecated and will be removed in a future release; use `dependency-groups.dev` instead
Installed 14 packages in 3.64s
============================= test session starts ==============================
platform linux -- Python 3.11.0rc1, pytest-8.4.1, pluggy-1.6.0 -- /app/.venv/bin/python3
cachedir: .pytest_cache
django: version: 5.2.5, settings: config.settings (from env)
rootdir: /app
configfile: pyproject.toml
plugins: anyio-4.10.0, django-4.11.1
collecting ... collected 6 items

extract/tests/integration/banks.py::TestWorkflows::test_partial_download[ebrd] PASSED [ 16%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[ebrd-https://www.ebrd.com/home/work-with-us/projects/psd/36582.html] PASSED [ 33%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[ebrd-https://www.ebrd.com/home/work-with-us/projects/psd/56092.html] PASSED [ 50%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[ebrd-https://www.ebrd.com/work-with-us/projects/psd/52642.html] PASSED [ 66%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[ebrd-https://www.ebrd.com/work-with-us/projects/psd/54846.html] PASSED [ 83%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[ebrd-https://www.ebrd.com/work-with-us/projects/psd/technonicol-regional-expansion--resource-efficiency.html] PASSED [100%]

=============================== warnings summary ===============================
../../.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:73
  /app/.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:73: DeprecationWarning: invalid escape sequence '\('
    'Digit9': {'keyCode': 57, 'code': 'Digit9', 'shiftKey': '\(', 'key': '9'},

../../.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:143
  /app/.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:143: DeprecationWarning: invalid escape sequence '\<'
    'Comma': {'keyCode': 188, 'code': 'Comma', 'shiftKey': '\<', 'key': ','},

../../.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:247
  /app/.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:247: DeprecationWarning: invalid escape sequence '\<'
    '<': {'keyCode': 188, 'key': '\<', 'code': 'Comma'},

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================== 6 passed, 3 warnings in 38.90s ========================

@jpivarski

Copy link
Copy Markdown
Author

Ran all tests again

And everything passed. (This is with a Gemini API key set up, though it also works without one.)

Command output
warning: The `tool.uv.dev-dependencies` field (used in `/app/pyproject.toml`) is deprecated and will be removed in a future release; use `dependency-groups.dev` instead
Installed 14 packages in 3.61s
============================= test session starts ==============================
platform linux -- Python 3.11.0rc1, pytest-8.4.1, pluggy-1.6.0 -- /app/.venv/bin/python3
cachedir: .pytest_cache
django: version: 5.2.5, settings: config.settings (from env)
rootdir: /app
configfile: pyproject.toml
plugins: anyio-4.10.0, django-4.11.1
collecting ... collected 81 items

extract/tests/integration/banks.py::TestWorkflows::test_download[deg] PASSED [  1%]
extract/tests/integration/banks.py::TestWorkflows::test_download[dfc] PASSED [  2%]
extract/tests/integration/banks.py::TestWorkflows::test_download[kfw] PASSED [  3%]
extract/tests/integration/banks.py::TestWorkflows::test_download[wb] PASSED [  4%]
extract/tests/integration/banks.py::TestWorkflows::test_partial_download[afdb] PASSED [  6%]
extract/tests/integration/banks.py::TestWorkflows::test_partial_download[ebrd] PASSED [  7%]
extract/tests/integration/banks.py::TestWorkflows::test_partial_download[idb] PASSED [  8%]
extract/tests/integration/banks.py::TestWorkflows::test_generate_seed_urls[adb] PASSED [  9%]
extract/tests/integration/banks.py::TestWorkflows::test_generate_seed_urls[aiib] PASSED [ 11%]
extract/tests/integration/banks.py::TestWorkflows::test_generate_seed_urls[bio] PASSED [ 12%]
extract/tests/integration/banks.py::TestWorkflows::test_generate_seed_urls[eib] PASSED [ 13%]
extract/tests/integration/banks.py::TestWorkflows::test_generate_seed_urls[fmo] PASSED [ 14%]
extract/tests/integration/banks.py::TestWorkflows::test_generate_seed_urls[ifc] PASSED [ 16%]
extract/tests/integration/banks.py::TestWorkflows::test_generate_seed_urls[miga] PASSED [ 17%]
extract/tests/integration/banks.py::TestWorkflows::test_generate_seed_urls[nbim] PASSED [ 18%]
extract/tests/integration/banks.py::TestWorkflows::test_generate_seed_urls[pro] PASSED [ 19%]
extract/tests/integration/banks.py::TestWorkflows::test_generate_seed_urls[undp] PASSED [ 20%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-https://www.adb.org/iati/iati-activities-af.xml] PASSED [ 22%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-http://www.adb.org/iati/iati-activities-fj.xml] PASSED [ 23%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-http://www.adb.org/iati/iati-activities-id.xml] PASSED [ 24%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-http://www.adb.org/iati/iati-activities-in.xml] PASSED [ 25%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-https://www.adb.org/iati/iati-activities-kh.xml] PASSED [ 27%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-http://www.adb.org/iati/iati-activities-mn.xml] PASSED [ 28%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-http://www.adb.org/iati/iati-activities-ph.xml] PASSED [ 29%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-https://www.adb.org/iati/iati-activities-reg.xml] PASSED [ 30%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-http://www.adb.org/iati/iati-activities-uz.xml] PASSED [ 32%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[adb-http://www.adb.org/iati/iati-activities-vn.xml] PASSED [ 33%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[aiib-https://www.aiib.org/en/projects/details/2023/approved/Egypt-Sustainable-Transport-and-Digital-Infrastructure-Guarantee.html] PASSED [ 34%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[aiib-https://www.aiib.org/en/projects/details/2016/approved/Tajikistan-Dushanbe-Uzbekistan-Border-Road-Improvement.html] PASSED [ 35%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[aiib-https://www.aiib.org/en/projects/details/2023/proposed/Viet-Nam-Gia-Lai-Wind-Power-Project.html] PASSED [ 37%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[fmo-https://www.fmo.nl/project-detail/45033] PASSED [ 38%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[fmo-https://www.fmo.nl/project-detail/60377] PASSED [ 39%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[fmo-https://www.fmo.nl/project-detail/62828] PASSED [ 40%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[ifc-https://disclosuresservice.ifc.org/api/searchprovider/searchenterpriseprojects?payload=*&$start=0$srt=disclosed_date$order=desc$rows=100] PASSED [ 41%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[ifc-https://disclosuresservice.ifc.org/api/searchprovider/searchenterpriseprojects?payload=*&$start=200$srt=disclosed_date$order=desc$rows=100] PASSED [ 43%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[ifc-https://disclosuresservice.ifc.org/api/searchprovider/searchenterpriseprojects?payload=*&$start=8300$srt=disclosed_date$order=desc$rows=100] PASSED [ 44%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[ifc-https://disclosuresservice.ifc.org/api/searchprovider/searchenterpriseprojects?payload=*&$start=11800$srt=disclosed_date$order=desc$rows=100] PASSED [ 45%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[ifc-https://disclosuresservice.ifc.org/api/searchprovider/searchenterpriseprojects?payload=*&$start=11900$srt=disclosed_date$order=desc$rows=100] PASSED [ 46%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[miga-https://www.miga.org/project/bboxx-rwanda-kenya-and-drc-0] PASSED [ 48%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[miga-https://www.miga.org/project/dedicated-freight-corridor-corporation-india-limited-1] PASSED [ 49%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[miga-https://www.miga.org/project/koridori-srbije-ltd-morava-motorway-0] PASSED [ 50%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[pro-https://www.proparco.fr/en/carte-des-projets/ecobank-trade-finance] PASSED [ 51%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[pro-https://www.proparco.fr/en/carte-des-projets/darby-latam-pf-iii] PASSED [ 53%]
extract/tests/integration/banks.py::TestWorkflows::test_project_page_scrape[pro-https://www.proparco.fr/en/carte-des-projets/mpef-iv] PASSED [ 54%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[afdb-https://mapafrica.afdb.org/api/v13/activities/46002-P-MZ-AA0-045/organisations] PASSED [ 55%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[bio-https://www.bio-invest.be/en/investments/banco-guayaquil-1] PASSED [ 56%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[bio-https://www.bio-invest.be/en/investments/cofina-mali] PASSED [ 58%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[bio-https://www.bio-invest.be/en/investments/zoscales-fund-i] PASSED [ 59%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[ebrd-https://www.ebrd.com/home/work-with-us/projects/psd/36582.html] PASSED [ 60%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[ebrd-https://www.ebrd.com/home/work-with-us/projects/psd/56092.html] PASSED [ 61%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[ebrd-https://www.ebrd.com/work-with-us/projects/psd/52642.html] PASSED [ 62%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[ebrd-https://www.ebrd.com/work-with-us/projects/psd/54846.html] PASSED [ 64%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[ebrd-https://www.ebrd.com/work-with-us/projects/psd/technonicol-regional-expansion--resource-efficiency.html] PASSED [ 65%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[eib-https://www.eib.org/en/projects/all/20190714] PASSED [ 66%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[idb-https://www.iadb.org/en/project/BO0060] PASSED [ 67%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[idb-https://www.iadb.org/en/project/CO-T1792] PASSED [ 69%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[idb-https://www.iadb.org/en/project/RG-Q0153] PASSED [ 70%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[undp-https://api.open.undp.org/api/projects/00061970.json] PASSED [ 71%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[undp-https://api.open.undp.org/api/projects/00091070.json] PASSED [ 72%]
extract/tests/integration/banks.py::TestWorkflows::test_project_partial_page_scrape[undp-https://api.open.undp.org/api/projects/00107513.json] PASSED [ 74%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_multi_scrape[bio-https://www.bio-invest.be/en/investments/p1] PASSED [ 75%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_multi_scrape[bio-https://www.bio-invest.be/en/investments/p17] PASSED [ 76%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_multi_scrape[bio-https://www.bio-invest.be/en/investments/p31] PASSED [ 77%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_multi_scrape[eib-https://www.eib.org/page-provider/projects/list?pageNumber=0&itemPerPage=500&pageable=true&sortColumn=id] PASSED [ 79%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_multi_scrape[eib-https://www.eib.org/page-provider/projects/list?pageNumber=17&itemPerPage=500&pageable=true&sortColumn=id] PASSED [ 80%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_multi_scrape[eib-https://www.eib.org/page-provider/projects/list?pageNumber=32&itemPerPage=500&pageable=true&sortColumn=id] PASSED [ 81%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_multi_scrape[undp-http://open.undp.org/download/iati_xml/Belarus_Republic_of_projects.xml] PASSED [ 82%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_multi_scrape[undp-http://open.undp.org/download/iati_xml/Gambia_projects.xml] PASSED [ 83%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_multi_scrape[undp-https://open.undp.org/download/iati_xml/Niue_projects.xml] PASSED [ 85%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[fmo-https://www.fmo.nl/worldmap?page=1] PASSED [ 86%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[fmo-https://www.fmo.nl/worldmap?page=30] PASSED [ 87%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[fmo-https://www.fmo.nl/worldmap?page=55] PASSED [ 88%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[idb-https://www.iadb.org/en/project-search?page=0] PASSED [ 90%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[idb-https://www.iadb.org/en/project-search?page=400] PASSED [ 91%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[idb-https://www.iadb.org/en/project-search?page=997] PASSED [ 92%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[miga-https://www.miga.org/projects?page=0] PASSED [ 93%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[miga-https://www.miga.org/projects?page=68] PASSED [ 95%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[miga-https://www.miga.org/projects?page=118] PASSED [ 96%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[pro-https://www.proparco.fr/en/projects/list?page=0] PASSED [ 97%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[pro-https://www.proparco.fr/en/projects/list?page=25] PASSED [ 98%]
extract/tests/integration/banks.py::TestWorkflows::test_result_page_scrape[pro-https://www.proparco.fr/en/projects/list?page=39] PASSED [100%]

=============================== warnings summary ===============================
../../.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:73
  /app/.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:73: DeprecationWarning: invalid escape sequence '\('
    'Digit9': {'keyCode': 57, 'code': 'Digit9', 'shiftKey': '\(', 'key': '9'},

../../.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:143
  /app/.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:143: DeprecationWarning: invalid escape sequence '\<'
    'Comma': {'keyCode': 188, 'code': 'Comma', 'shiftKey': '\<', 'key': ','},

../../.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:247
  /app/.venv/lib/python3.11/site-packages/pyppeteer/us_keyboard_layout.py:247: DeprecationWarning: invalid escape sequence '\<'
    '<': {'keyCode': 188, 'key': '\<', 'code': 'Comma'},

src/pipeline/extract/tests/integration/banks.py::TestWorkflows::test_download[deg]
  /app/src/pipeline/extract/workflows/banks/deg.py:237: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
    df[cols] = df[cols].replace({None: ""})

src/pipeline/extract/tests/integration/banks.py::TestWorkflows::test_download[kfw]
  /app/src/pipeline/extract/workflows/banks/kfw.py:182: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
    df[cols] = df[cols].replace({None: ""})

src/pipeline/extract/tests/integration/banks.py::TestWorkflows::test_download[wb]
  /app/src/pipeline/extract/workflows/banks/wb.py:141: SettingWithCopyWarning: 
  A value is trying to be set on a copy of a slice from a DataFrame.
  Try using .loc[row_indexer,col_indexer] = value instead
  
  See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
    financers_df["Financers"] = financers_df["Name"].str.upper()

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================== 81 passed, 6 warnings in 446.48s (0:07:26) ==================

@jpivarski jpivarski requested review from LaunaG and removed request for trevorspreadbury March 4, 2026 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant