Reflections on the FCA synthetic data call for input

The latest FCA call for input on synthetic data [1] in 2022 provides an opportunity for academics, startups, RegTechs, FinTechs and technology firms to share their views on the adoption and opportunities of using synthetic data within financial services. In simple words, synthetic data is data generated by a computer (or human) as an alternative to data collected after measuring or observing a process (such as the financial system).

This call for input forms a crucial bridge between the regulator and the financial services industry and encourages further innovation. FinCrime Dynamics will also send its thoughts to show the community our way of thinking and approach towards synthetic data for enhancing financial crime analytics.

Over the last 3 years the Alan Turing Institute has shown substantial interest in the topic of synthetic data for the improvement of AI within the financial services. Most notably, with the launch of their research project ‘Synthetic data generation for finance and economics’ with key industry partners [2]. The financial regulators in the UK such as the FCA have clearly understood this need. Evidence of this are the initiatives based on synthetic data such as the techsprint in 2019 [3], the data sprint in 2020 and the digital sandbox pilot in 2021 [4]. The contribution of FinCrime Dynamics and the interaction with the community on these events were the genesis of the financial crime vaccine concept.

DATA IS THE FUEL OF AI

There is no doubt that data is the fundamental fuel to drive innovation in the artificial intelligence era. Alternative data such as synthetic data has gained popularity due to the benefits in domains where confidentiality is a primary concern such as financial services. There are high expectations that synthetic data will become a disruptive technology in the near future within financial services. It will give access to third parties to develop cost effective solutions that can significantly change the financial services efficiencies through the democratisation of confidential data. Gartner predicts that 60% of AI will be trained using synthetic data by 2024 [5]. The FCA has also included the exploration of synthetic datasets to test financial crime controls in their business plan 2022/23 [6].

The concept of using synthetic data for money laundering detection has been around for over 10 years [7].  Synthetic data has already enabled solutions that unlock data sharing and pave the way for innovation in the financial services using artificial intelligence (AI). 

Artificial intelligence has been a transformative force in financial services that plays a key role in innovation. The world economic forum in its report ‘The New Physics of Financial Services’ states that innovation has many stages, from doing the same things better or faster, to doing something radically different that creates new value propositions [8]. Innovation in AI is driven by data and access to high quality data is perhaps the most discouraging barrier. Since GDPR came into force in 2018 it was foreseen that it will have an impact on innovation [9]. It is unsurprising that there has been an emergence of startups that allow the industry to generate synthetic data.

PUBLIC SYNTHETIC DATASETS

In 2017, Dr Edgar Lopez-Rojas shared with Kaggle’s scientific community two synthetic datasets: The PaySim dataset and the BankSim dataset for research into financial crime analytics. These datasets contain transactional and demographic information of customers and their behaviours that are enriched with labels of fraudulent behaviour. This would greatly improve the ability of firms to train machine learning models for the detection of financial crime.

The PaySim dataset (https://www.kaggle.com/datasets/ealaxi/paysim1) simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. This was part of the outcome of the research on the synthetic generation of financial transactions from Dr Lopez-Rojas [10]. This dataset has been downloaded more than 52,000 times from practitioners, industry, financial services and academia. It has also been cited by the MIT-IBM Watson lab in their research on scalable graph learning for AML [11]. The PaySim dataset has also been hosted in the Digital Sandbox from the FCA and the City of London. 

The second synthetic dataset was the BankSim project [12]. The BankSim dataset is an agent-based simulator of bank payments based on a sample of aggregated transactional data provided by a bank in Spain. Similar to the predecessor, this dataset has been downloaded more than 18,000 times and cited by relevant academic and industrial researchers.

CALL FOR INPUT ON SYNTHETIC DATA

The FCA has previously used synthetic data during the techsprint in 2019 called ‘Global AML and Financial Crime TechSprint’. Over 140 participants focused on how privacy enhancing technologies (PETs) can facilitate information sharing to fight money laundering and financial crime.

This specific call for input on synthetic data has 16 questions (see annex) that aim to collect important information about the value proposition of synthetic data, specifically use cases within financial services and the potential for collaboration. The deadline for sending comments is the 22nd June 2022. Every single point of view from the members of the community is relevant to build an unified front around the value of synthetic data within financial services. 

Blog by Dr Edgar Lopez Rojas (CTO of FinCrime Dynamics)

SOURCES

[1] FCA, Call for input: Synthetic data to support financial services innovation (2022). https://www.fca.org.uk/publications/calls-input/call-input-synthetic-data-support-financial-services-innovation 

[2] Alan Turing Institute. Synthetic data generation for finance and economics (accessed in May 2022). https://www.turing.ac.uk/research/research-projects/synthetic-data-generation-finance-and-economics

[3] FCA, 2019 Global AML and Financial Crime TechSprint (accessed in May 2022). https://www.fca.org.uk/events/techsprints/2019-global-aml-and-financial-crime-techsprint

[4] FCA, Digital sandbox pilot: FCA DataSprint (accessed in May 2022). https://www.fca.org.uk/firms/innovation/digital-sandbox-pilot-datasprint 

[5] Gartner, 2021. Predicts 2021: Data and Analytics Strategies to Govern, Scale and Transform Digital Business. https://www.gartner.com/en/documents/3993855

[6] FCA, 2022. FCA business plan 2022/23. https://www.fca.org.uk/publication/corporate/business-plan-2022-23.pdf 

[7] Lopez-Rojas, E.A. and Axelsson, S., 2012. Money laundering detection using synthetic data. In Annual workshop of the Swedish Artificial Intelligence Society (SAIS). Linköping University Electronic Press, Linköpings universitet.

[8] World Economic Forum. The New Physics of Financial Services (2018). https://www3.weforum.org/docs/WEF_New_Physics_of_Financial_Services.pdf 

[9] Lopez Rojas, E.A., Gultemen, D. and Zoto, E., 2018. On the GDPR introduction in EU and its impact on financial fraud research. In European Modeling and Simulation Symposium, EMSS 2018. Cal-tek Srl.

[10] E. A. Lopez-Rojas , A. Elmir, and S. Axelsson. "PaySim: A financial mobile money simulator for fraud detection". In: The 28th European Modeling and Simulation Symposium-EMSS, Larnaca, Cyprus. 2016

[11] Weber, M., Chen, J., Suzumura, T., Pareja, A., Ma, T., Kanezashi, H., Kaler, T., Leiserson, C.E. and Schardl, T.B., 2018. Scalable graph learning for anti-money laundering: A first look. arXiv preprint arXiv:1812.00076.

[12] Lopez-Rojas, Edgar Alonso ; Axelsson, Stefan. Banksim: A bank payments simulator for fraud detection research Inproceedings 26th European Modeling and Simulation Symposium, EMSS 2014, Bordeaux, France, pp. 144–152, Dime University of Genoa, 2014, ISBN: 9788897999324.

Previous
Previous

Former FCA Director of Innovation Nick Cook joins advisory board of FinCrime Dynamics

Next
Next

Why “Financial Crime Vaccines” Are The RegTech Breakthrough of 2022