Chapter 4: Testing for Trust
Evaluating Vendor Accuracy and Reliability
In this chapter, we’ll discuss how to assess the accuracy and reliability of a screening system vendor. Factors to consider include the vendor's data quality, update frequency, and algorithm effectiveness. This structured approach allows you to methodically assess each critical feature of the system, from its data handling capabilities to its integration with your existing infrastructure and user interface. Each provides insight into the system’s performance in real-world scenarios to meet your specific operational and compliance needs.
Testing Environment
Your vendor should provide a testing environment that allows you to experiment and understand the system's ability to meet your specific operational needs. They should provide API access to test system capabilities, either through immediate access to the production environment during the testing phase of an evaluation, or an API sandbox that mirrors the production environment.
Live API Access
Ideal for real-time environment testing; your compliance system vendor should provide access to the live production environment for compliance screening tests during the evaluation phase of your procurement process.
Also, ensure the vendor provides
Comprehensive documentation for ease of testing. At Castellum.AI, our process and technical documents are publicly available.
Active support during the testing phase for any technical assistance.
Feedback mechanisms for continuous improvement post-testing.
These elements are essential for thoroughly and effectively evaluating a screening system's capabilities.
API Sandbox Testing
An API Sandbox should reflect the same performance metrics as your vendor’s production API. Access to a sandbox environment helps test the screening system’s functionalities and conduct initial test queries. Once you’ve finalized integration with a screening vendor, having continued access to a sandbox environment enables ongoing performance monitoring, regular screening tests or testing changes to your internal screening process before implementing in the production environment.
Sample Data Testing
Start by creating diverse test cases, including test screening entries from various risk categories and jurisdictions. Use data that represents typical scenarios your organization might encounter. You should consider:
Coverage and Completeness
Assess if the system covers all necessary risk categories and jurisdictions relevant to your operations. Ensure it includes comprehensive data from global sanctions lists, PEPs, export controls and adverse media sources relevant to your risk profile. Vendors should offer high-quality, well-structured, and enriched data. This data should be error-free and categorized for easy search and integration.
Questions for vendors
How do you ensure the accuracy and quality of your global risk data?
Can you describe your data enrichment process?
How do you handle data errors from watchlist issuing authorities?
Assessing a System’s Matching Algorithm and Alert Accuracy
Measure the system’s ability to identify correctly and flag entities based on their risk level. This includes flexible query settings to test how the system flags known high-risk entities and avoids false positives with low-risk or compliant entities to minimize false positives. This process should include evaluating the system’s ability to distinguish between screening inputs with similar names or other identifying information as watchlisted parties.
At Castellum.AI, our matching algorithm, Jgram, is designed to provide the lowest rate of false positives while avoiding any false negatives. Unlike traditional matching algorithms based on Soundex, Levenshtein or other models, Jgram analyzes text on a more nuanced level by tokenizing inputs into sequences of characters rather than trying to match against the entire query’s input. This approach solves for common issues in legacy screening systems leading to false positives and this also means Jgram is effective across 100+ different languages.
False Positive Analysis
This crucial step involves rigorously testing the system's ability to differentiate between true compliance risks and false positives. Conduct a series of tests using diverse data samples to evaluate the system's accuracy in flagging risk.
Strategy: Focus on the system's ability to correctly identify actual risks while minimizing false alerts. Develop test screening datasets and establish your screening parameters for tests, such as a minimum score threshold, the inclusion or exclusion of weak AKAs, and specific list sources. This enables you to compare multiple vendors. Sample datasets should include:
Exact matches: Screen data with inputs that are exact matches with public watchlists like sanctions to confirm the screening system’s ability to generate true positive alerts.
Partial Matches: Screen data similar to exact matches with information in public watchlists like sanctions to confirm the screening system’s ability to generate true positive alerts on partial matches.
Real data: Test scenarios where entities are similar, but not identical, to those on watchlists to see how the system responds. This enables you to collect a comparison baseline against your existing screening system.
Data Review: Carefully review instances where the system generates alerts. Assess whether these alerts are valid based on your specific criteria and operational context to ensure your team focuses on genuine risks without expending resources investigating non-issues.
By effectively analyzing the system's performance in this area, you gain valuable insights into its reliability, ensuring it aligns with your organization's risk management and compliance requirements.
Data Consistency
Check for consistency in how the system categorizes and displays information across different test cases, ensuring uniformity in data presentation.
Response Time Evaluation
Measure how long it takes for the system to provide a response to a query. This assessment can be done by tracking the time from data entry to the system's alert or response.
Compare these times with industry standards or your organization's requirements. Response times are critical in fast-paced environments like financial services. At Castellum.AI, our API average watchlist response time is under 40ms.
Customization and Scalability Checks
Test the system’s ability to adapt to your specific needs. Consider its flexibility in adjusting risk thresholds and using filters to suit your operational requirements. Additionally, evaluate scalability by incrementally increasing data loads or user numbers, checking the system’s performance under growing demands. This assessment ensures the system can evolve with your organization's needs.
This is a lot to remember, but we’ve got you covered. Download our screening system buyer’s checklist to keep track of it all when you’re assessing vendors.