Nuit Blanche

Saturday, September 27, 2025

A Paradigm Shift: Reasoning at Enteprise Scale

When performing retrieval at scale on large sets of enteprise documents, it becomes very clear that current Retrieval Augmented Generation (RAG)-like approaches are not well suited (irrespective to the context windows becoming very large). The "RAG is dead" meme that comes out every so often, willfully ignores that

most interesting sets of documents are always beyond the latest largest context window that the cool kids talk about
the reason we want a satisfying RAG is that we do not want to choose the documents that will come into the context window
the current story is about text, get ready for images, voice and videos
large context windows do not assure a level of recall quality

If company documents are the context needed to have a purposeful discussion with LLMs inside a company or if new services or products are built on internal documents, then we need to have new algorithms for an enriched experience with all the company knowledge.

At LightOn, we believe the future of AI retrieval lies in reasoning, not just pattern matching. As Antoine Chaffin explained in his Maven podcast appearance, single-vector embeddings collapse nuance into one dimension, limiting systems to shallow similarity. (Before you read the rest of the blog post, do not hesitate to get in touch if you want to help in building this new stack)

Late-interaction models take a different approach:

Every token is preserved as its own vector.
Matching happens late, at the interaction stage.
The result: deeper semantic understanding and genuine reasoning.

This simple but powerful insight has sparked an open-source ecosystem that’s now shaping both academic research and production-scale AI systems.

PyLate: From Experimental Code to Peer-Reviewed Paper

PyLate began as an internal experiment to simplify multi-vector training. Today, it’s a full-fledged library with 527 GitHub stars and growing adoption.

Academic recognition: PyLate’s paper was accepted at CIKM 2025 (see below), becoming the first peer-reviewed library dedicated to training ColBERT-style models.
Practical impact: Researchers can train a state-of-art retrieval model on MS MARCO in under 2 hours with just ~80 lines of code.
Real-world benefit: Out-of-domain search, reasoning-heavy tasks, and long-context retrieval become accessible to any team.

if you want to learn more about the library: PyLate documentation

ModernBERT: Re-Imagining the Encoder

In partnership with Answer.AI, LightOn co-developed ModernBERT, a model that fundamentally rethinks encoder architecture.

8192-token context with Flash Attention, running efficiently on consumer GPUs.
1,500 GitHub stars and 27M+ downloads on HuggingFace.
Poster presentation at ACL 2025 (Vienna): validation from one of NLP’s most competitive venues.

ModernBERT has already been cited 305+ times, with variants like BioClinical ModernBERT emerging for healthcare applications.

👉 Explore: ModernBERT LightOn blog post

FastPlaid: Performance That Scales

Building great models is only half the challenge, making them work in production is the other. That’s where FastPlaid comes in.

A Rust + CUDA engine for multi-vector search.
Delivers +554% throughput improvements for multi-vector search compared to Stanford’s PLAID baseline.
Designed for scalability: powering recommendation engines, retrieval-augmented generation (RAG), and real-time search.

As Raphael Sourty explains, static indexes solve many use cases, but mutable indexes (new in v1.10.0) unlock real-world applications where data evolves continuously.

👉 Read more: FastPlaid LightOn blogpost

PyLate-rs: Retrieval in the Browser

Finally, to push accessibility even further, PyLate-rs compiles late-interaction inference to WebAssembly (WASM).

That means:

Run a state-of-the-art retriever directly in the browser.
Achieve 97% faster cold-start performance on CPU.
Remove server dependencies entirely.

This lowers the barrier for demos, education, and lightweight deployments, proving late-interaction isn’t just powerful, it’s portable.

From Theory to Production: A Movement

Taken together, these projects form a technical symphony:

ModernBERT provides the backbone.
PyLate enables fast and easy training of SOTA models.
FastPlaid ensures scalable search performance.
PyLate-rs brings inference to any environment.

The ecosystem has grown from an academic curiosity into a reasoning-first retrieval stack. With recognition at CIKM and ACL, adoption across GitHub and HuggingFace, and practical tools for real-world workflows, LightOn is helping shape the next era of AI search.

📖 Explore LightOn’s open-source ecosystem:

PyLate
ModernBERT
FastPlaid
PyLate has already enabled the development of state-of-the-art models, such as:

Other models:

Dataset

FC-AMF-OCR Dataset : a 9.3 million images OCR dataset to improve real world document parsing

Pre-training libraries

PyLate: Flexible Training and Retrieval for Late Interaction Models by Antoine Chaffin, Raphaël Sourty

Neural ranking has become a cornerstone of modern information retrieval. While single vector search remains the dominant paradigm, it suffers from the shortcoming of compressing all the information into a single vector. This compression leads to notable performance degradation in out-of-domain, long-context, and reasoning-intensive retrieval tasks. Multi-vector approaches pioneered by ColBERT aim to address these limitations by preserving individual token embeddings and computing similarity via the MaxSim operator. This architecture has demonstrated superior empirical advantages, including enhanced out-of-domain generalization, long-context handling, and performance in complex retrieval scenarios. Despite these compelling empirical results and clear theoretical advantages, the practical adoption and public availability of late interaction models remain low compared to their single-vector counterparts, primarily due to a lack of accessible and modular tools for training and experimenting with such models. To bridge this gap, we introduce PyLate, a streamlined library built on top of Sentence Transformers to support multi-vector architectures natively, inheriting its efficient training, advanced logging, and automated model card generation while requiring minimal code changes to code templates users are already familiar with. By offering multi-vector-specific features such as efficient indexes, PyLate aims to accelerate research and real-world application of late interaction models, thereby unlocking their full potential in modern IR systems. Finally, PyLate has already enabled the development of state-of-the-art models, including GTE-ModernColBERT and Reason-ModernColBERT, demonstrating its practical utility for both research and production environments.

🌐 Learn more about lighton.ai

** Nuit Blanche is now on Twitter: @NuitBlog **
Follow @NuitBlog or join the CompressiveSensing Reddit, the Facebook page, the Compressive Sensing group on LinkedIn or the Advanced Matrix Factorization group on LinkedIn

Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email.

Other links:
Paris Machine Learning: Meetup.com||@Archives||LinkedIn||Facebook|| @ParisMLGroup About LightOn: Newsletter ||@LightOnIO|| on LinkedIn || on CrunchBase || our Blog
About myself: LightOn || Google Scholar || LinkedIn ||@IgorCarron ||Homepage||ArXiv

Sunday, December 22, 2024

ModernBERT: Smarter, Better, Faster and with Longer context

🎄 Just in time for the magical week 🎅: LightOn and Answer.AI just made available a new model called ModernBERT.

ModernBERT is available as a slot-in replacement for any BERT-like models, with both a base (139M params) and large (395M params) model size.

To get a sense of how important the BERT model and its derivatives are, here are some figures:

Out of the 1.2 million different models uploaded on HuggingFace since its inception, Google's initial BERT model is the second model most downloaded with more than 65 millions downloads last month.
In the first 30 most downloaded models, BERT and related models account for 325 millions downloads last month.

We hope the community likes ModernBERT and build applications that will be smarter 🧠 , better 🛰️ , faster 🚀 and with longer context 🦒 .

Here is the preprint:

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference by Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, Iacopo Poli

Encoder-only transformer models such as BERT offer a great performance-size tradeoff for retrieval and classification tasks with respect to larger decoder-only models. Despite being the workhorse of numerous production pipelines, there have been limited Pareto improvements to BERT since its release. In this paper, we introduce ModernBERT, bringing modern model optimizations to encoder-only models and representing a major Pareto improvement over older encoders. Trained on 2 trillion tokens with a native 8192 sequence length, ModernBERT models exhibit state-of-the-art results on a large pool of evaluations encompassing diverse classification tasks and both single and multi-vector retrieval on different domains (including code). In addition to strong downstream performance, ModernBERT is also the most speed and memory efficient encoder and is designed for inference on common GPUs.

Thursday, August 17, 2023

Large Language Models and Transformers (Videos, Simons Institute for the Theory of Computing)

As some of you may know, LightOn has built a few Large Language Models, and we are now making them usable to Enterprise customers. In the meantime and on the theoretical side of things, the Simons Institute for the Theory of Computing has organized a workshop on the topic of Large Language Models and Transformers. The program is listed below, every link links to the video of the talk (that includes streaming this week).

Monday, Aug. 14, 2023

9:15 – 10:15 a.m. Sparks of Artificial General Intelligence, Yin Tat Lee (Microsoft Research)
11 a.m. – 12 p.m. Possible Impossibilities and Impossible Possibilities, Yejin Choi (University of Washington)
1:30 – 2:30 p.m. Towards Reliable Use of Large Language Models: Better Detection, Consistency, and Instruction-Tuning, Christopher D. Manning (Stanford University)
3 – 4 p.m. An observation on Generalization, Ilya Sutskever (OpenAI)
4 – 4:45 p.m. Panel Discussion (moderated by Alexei Efros)

Tuesday, Aug. 15, 2023

9 – 10 a.m. Understanding the Origins and Taxonomy of Neural Scaling Laws, Yasaman Bahri (Google DeepMind)
10 – 11 a.m. Scaling Data-Constrained Language Models, Sasha Rush (Cornell University & Hugging Face)
11:30 a.m. – 12:30 p.m. A Theory for Emergence of Complex Skills in Language Models, Sanjeev Arora (Princeton University)
2 – 3 p.m. Interpretability via Symbolic Distillation, Miles Cranmer (Flatiron Institute)
3:30 – 4:30 p.m. Build an Ecosystem, Not a Monolith, Colin Raffel (University of North Carolina & Hugging Face)
4:30 – 5:30 p.m. How to Use Self-Play for Language Models to Improve at Solving Programming Puzzles, Adam Tauman Kalai (Microsoft)

Wednesday, Aug. 16, 2023

9 – 10 a.m. Large Language Models Meet Copyright Law, Pamela Samuelson (UC Berkeley)
10 – 10:45 a.m. Panel Discussion (moderated by Shafi Goldwasser)
11:15 a.m. – 12:15 p.m. On Localization in Language Models, Yonatan Belinkov (Technion - Israel Institute of Technology)
2 – 3 p.m. Language Models as Statisticians, and as Adapted Organisms, Jacob Steinhardt (UC Berkeley)
3:30 – 4:30 p.m. Are Aligned Language Models “Adversarially Aligned”?, Nicholas Carlini (Google DeepMind)
4:30 – 5:30 p.m. Formalizing Explanations of Neural Network Behaviors, Paul Christiano (Alignment Research Center)

Thursday, Aug. 17, 2023

9 – 10 a.m. Meaning in the age of large language models, Steven Piantadosi (UC Berkeley)
10 – 11 a.m. Word Models to World Models, Josh Tenenbaum (MIT)
11:30 a.m. – 12:30 p.m. Beyond Language: Scaling up Robot Ontogeny, Jitendra Malik (UC Berkeley)
2 – 3 p.m. Are LLMs the Beginning or End of NLP?, Dan Klein (UC Berkeley)
3:30 – 4:30 p.m. Human-AI Interaction in the Age of Large Language Models, Diyi Yang (Stanford University)
4:30 – 5:30 p.m. Watermarking of Large Language Models, Scott Aaronson (UT Austin & OpenAI)

Friday, Aug. 18, 2023

9 – 10 a.m. In-Context Learning: A Case Study of Simple Function Classes, Gregory Valiant (Stanford University)
10 – 11 a.m. Pretraining Task Diversity and the Emergence of Non-Bayesian In-Context Learning for Regression, Surya Ganguli (Stanford University)
11:30 a.m. – 12:30 p.m. A data-centric view on reliable generalization: From ImageNet to LAION-5B, Ludwig Schmidt (University of Washington)
2 – 3:30 p.m. Short Talks
4 – 5 p.m. Short Talks

** Nuit Blanche is now on Twitter: @NuitBlog **

Follow @NuitBlog or join the CompressiveSensing Reddit, the Facebook page, the Compressive Sensing group on LinkedIn or the Advanced Matrix Factorization group on LinkedIn

Friday, December 31, 2021

2021, the year AI ate HPC … and more

Back in 2011, Marc Andreesen announced that Software was eating the world while everyone was trying to make sense of the realities of the cloud versus brick and mortar businesses. Eight years later, Tarry Singh articulated how AI was eating software; a year before GPT-3 and Codex would give solid ground to this prediction. Fast forward two years later, we just witnessed how AI ate HPC and we believe those are the first steps towards how AI is eating Learning, Creative and Office work.

Let me explain.

At LightOn, we have been working on getting AI to be transformative for everyone. For that to happen, we used the Jean Zay French national supercomputer for two different yet somehow related reasons this past year. First, our LightOn’s Optical Processing Unit hardware was integrated into this top105 supercomputer. Even though LightOn’s hardware is analog and uses a technology currently unknown to supercomputing, there are several good reasons the future of computing will use this technology. Relatedly, in a co-design fashion, we also used the Jean Zay facility to implement and run code for the building of Large Language/Foundation Models that we believe are key to Transformative AI. In March, we trained the largest French language model ever called Auriga and made it available to everyone through our PAGnol demo.

In July, we launched the Muse API, making our language models available for business use. Initially released in private beta, Muse has quickly gained its first customers, and a public commercial version with five languages is to be released in early 2022. Some of these early customers are using this new AI to redefine SEO or the experience for website creation.

“True happiness comes from the joy of deeds well done, the zest of creating things new” Antoine de Saint-Exupéry

Eventually, a major impact of these Large Language Models trained on HPC infrastructures will be the ability for everyone to personally learn faster and for office workers worldwide to get the job done in a fashion never seen before.

If you are a start-up company or an individual starting a business around this promise, don’t hesitate to join the Muse Partnership program, and let’s start a discussion around how Muse can help you.

These models will also have the same effect in creative work and in the discovery process.

Stay tuned, the true AI revolution is really coming!

Follow @NuitBlog or join the CompressiveSensing Reddit, the Facebook page, the Compressive Sensing group on LinkedIn or the Advanced Matrix Factorization group on LinkedIn

Tuesday, December 21, 2021

LightOn Photonic coprocessor integrated into European AI Supercomputer

** Nuit Blanche is now on Twitter: @NuitBlog **

This is history of computing in the making stuff!

Four years ago to the day, LightOn’s first Optical Processing Unit (OPU) had its first light in a Data Center showing that our technology was data center ready.

It is with immense pride and pleasure to announce that LightOn’s OPU has been installed in one of the world’s Top500 supercomputer as part of a pilot program with GENCI and IDRIS/CNRS.

The team at LightOn is immensely proud to write the future of computing in this world-first integration of a computing photonic device into an HPC infrastructure.

The press release can be found here.

Thank you GENCI and IDRIS/CNRS for making this happen!

Follow @NuitBlog or join the CompressiveSensing Reddit, the Facebook page, the Compressive Sensing group on LinkedIn or the Advanced Matrix Factorization group on LinkedIn

Friday, May 21, 2021

The Akronomicon: an Extreme-Scale Leaderboard

** Nuit Blanche is now on Twitter: @NuitBlog **

As larger models seem to be providing more context and more ability for zero-shot learning, Julien just created the Akronomicon: an Extreme-Scale Leaderboard featuring the world's largest Machine Learning Models. And yes, LightOn is on that board for the moment!

Want to contribute? https://github.com/lightonai/akronomicon

Follow @NuitBlog or join the CompressiveSensing Reddit, the Facebook page, the Compressive Sensing group on LinkedIn or the Advanced Matrix Factorization group on LinkedIn

Wednesday, April 28, 2021

Virtual Workshop: Conceptual Understanding of Deep Learning (May 17th 9am-4pm PST)

** Nuit Blanche is now on Twitter: @NuitBlog **

Just got an email from Rina Panigrahy

Hi Igor,

I am an algorithms researcher at Google (http://theory.stanford.edu/~rinap) and I am organizing this workshop on "Conceptual Understanding of Deep Learning" (details below). It's trying to understand the Brain/Mind as an algorithm from a mathematical/theoretical perspective. I believe that a mathematical/algorithmic approach for understanding the Mind is crucial and very much missing. I'd appreciate any help I can get with advertising this on your blog/mailing-lists/twitter.

Best,
Rina

Here is the invite:

Please join us for a virtual Google workshop on “Conceptual Understanding of Deep Learning”

When: May 17th 9am-4pm PST.
Where: Live over Youtube,

Goal: How does the Brain/Mind (perhaps even an artificial one) work at an algorithmic level? While deep learning has produced tremendous technological strides in recent decades, there is an unsettling feeling of a lack of “conceptual” understanding of why it works and to what extent it will work in the current form. The goal of the workshop is to bring together theorists and practitioners to develop an understanding of the right algorithmic view of deep learning, characterizing the class of functions that can be learned, coming up with the right learning architecture that may (provably) learn multiple functions, concepts and remember them over time as humans do, theoretical understanding of language, logic, RL, meta learning and lifelong learning.

The speakers and panelists include Turing award winners Geoffrey Hinton, Leslie Valiant, and Godel Prize winner Christos Papadimitriou (full-details).

Panel Discussion: There will also be a panel discussion on the fundamental question of “Is there a mathematical model for the Mind?”. We will explore basic questions such as “Is there a provable algorithm that captures the essential capabilities of the mind?”, “How do we remember complex phenomena?”, “How is a knowledge graph created automatically?”, “How do we learn new concepts, function and action hierarchies over time?” and “Why do human decisions seem so interpretable?”

Twitter: #ConceptualDLWorkshop.
Please help advertise on mailing-lists/blog-posts and Retweet.

Hope to see you there!
Rina Panigrahy

Follow @NuitBlog or join the CompressiveSensing Reddit, the Facebook page, the Compressive Sensing group on LinkedIn or the Advanced Matrix Factorization group on LinkedIn

Tuesday, April 27, 2021

Randomized Algorithms for Scientific Computing (RASC)

** Nuit Blanche is now on Twitter: @NuitBlog **

At LightOn, we build photonic hardware that performs random projections and it is nice to find a source of materials on the subject in one document. Here is a report comprehensively presenting how randomized algorithms are key to the future of computing:

Randomized Algorithms for Scientific Computing (RASC) by Aydin Buluc, Tamara G. Kolda, Stefan M. Wild, Mihai Anitescu, Anthony DeGennaro, John Jakeman, Chandrika Kamath, Ramakrishnan (Ramki)Kannan, Miles E. Lopes, Per-Gunnar Martinsson, Kary Myers, Jelani Nelson, Juan M. Restrepo, C. Seshadhri, Draguna Vrabie, Brendt Wohlberg, Stephen J. Wright, Chao Yang, Peter Zwart

Randomized algorithms have propelled advances in artificial intelligence and represent a foundational research area in advancing AI for Science. Future advancements in DOE Office of Science priority areas such as climate science, astrophysics, fusion, advanced materials, combustion, and quantum computing all require randomized algorithms for surmounting challenges of complexity, robustness, and scalability. This report summarizes the outcomes of that workshop, "Randomized Algorithms for Scientific Computing (RASC)," held virtually across four days in December 2020 and January 2021.

Follow @NuitBlog or join the CompressiveSensing Reddit, the Facebook page, the Compressive Sensing group on LinkedIn or the Advanced Matrix Factorization group on LinkedIn

Tuesday, April 06, 2021

The $1,000 GPT-3

** Nuit Blanche is now on Twitter: @NuitBlog **

Progress usually comes from a steady technology bootstrap…until it doesn’t.

Take for instance the race for the $1,000 genome that started in the early 2000s. Initially, sequencing the human genome meant a race between the well-funded public and private sectors but more importantly, the resources for the first breakthrough ended up costing upwards of $450M. Yet despite all the economic promise of genome sequencing, had Moore’s law been applied, sequencing one full genome would still cost $100,000 today. However, once the goal became clearer to everyone, a diversity of technologies and challengers emerged. This intense competition eventually yielded a growth faster than Moore’s Law. The main takeaway is that one cannot rely on the steady progress of one specific technology alone to commoditize tools.

Figure from NIH “Facts sheets about genomics: The cost of Sequencing a Human Genome”, Dec 7th, 2020.

What does this have to do with the current state of silicon computing and the new demand for Large Language Models (LLMs)? Everything if you ask us and here is how.

Less than a year into existence, Large Language Models like GPT-3 have already spawned a new generation of startups built on the ability of the model to respond to requests for which it was not trained. More importantly for us, hardware manufacturers are positing that one or several customers will be willing to put a billion dollars on the table to train an even larger model in the coming years.

Interestingly, much like the mass industrialization in the 1930s, the good folks at OpenAI are sketching new scaling laws for the industrialization of these larger models.

The sad truth is that extrapolating their findings to the training of a 10 Trillion parameters model involves a supercomputer running continuously for two decades. The minimum capital expenditure of this adventure is estimated in the realm of several hundreds of million dollars.

Much like what happened in sequencing, while silicon improvement and architecture may achieve speedups in the following years, it is fair to say that, even with Moore’s law, no foreseeable technology can reasonably train a fully scaled-up GPT-4 and grab the economic value associated with it.

Rebooting silicon with a different physics, light, and NvNs

For a real breakthrough to occur, much like what happened in the sequencing story, different technologies need to be jointly optimized. In our case, this means performing co-design with new hardware and physics but also going rogue on full programmability.

LightOn’s photonic hardware can produce massively parallel matrix-vector multiplications with an equivalent of 2 trillion parameters “for free”: this is about one-fifth of the number of parameters needed for GPT-4. Next comes revisiting the programmability. Current LightOn’s technology keeps these weights fixed by design. Co-design means finding the algorithms for which CPUs and GPUs can perform some of the most intelligent computations and how LightOn’s massive Non-von Neumann (NvN) hardware can do the heavy lifting. We already published how we are replacing backpropagation, the workhorse of Deep Learning, with an algorithm that unleashes the full potential of our hardware in distributed training. We are also working similarly on an inference step that will take full advantage of the massive number of parameters at our disposal. This involved effort relies in a heavy part thanks to our access to ½ million GPU hours on some of France’s and Europe’s largest supercomputers.

And this is just the beginning. There is a vast untapped potential for repurposing large swaths of optical technologies directed primarily for entertainment and telecommunication into computing.

The road towards a $1,000 GPT-3

Based on the GPT-3 training cost estimates, achieving a $1,000 GPT-3 requires four orders of magnitude improvements. Much like what occurred in 2007 with the genome sequencing revolution, Moore’s law may take care of the first two orders of magnitude in the coming decade but the next two rely on an outburst of new efficient technologies — hardware and algorithms. It just so happens that GPT-3 has close to 100 layers, so achieving two orders of magnitude savings may arise faster than you can imagine. Stay tuned!

Igor Carron is the CEO and co-founder at LightOn

Follow @NuitBlog or join the CompressiveSensing Reddit, the Facebook page, the Compressive Sensing group on LinkedIn or the Advanced Matrix Factorization group on LinkedIn