TOP소버린 AI오픈소스포르투갈

Portugal built its own language AI, 'Amália,' for just €5.5M

Portugal stacked its own-language data on top of the European open model EuroLLM-9B to ship Amália, its first Portuguese LLM. Weights, data, and code are all open, and the budget was a mere €5.5M — a small nation's language-sovereignty experiment begins.

2026년 7월 5일 (일)·16분 소요

Portugal launches AMALIA, its first open-source Portuguese LLM — Source: Instituto de Telecomunicações

A nation of ten million just built its own language AI

On July 1, 2026, Portugal announced something quiet but surprisingly weighty. It's called Amália. It's the first large language model (LLM) built specifically for Portuguese, refined from the ground up by Portuguese hands. The headline isn't that it's a giant model some US frontier lab burned billions on — it's that a country of barely ten million people built it on a national budget of €5.5 million.

Look at the number alone and it feels almost cute. Next to what OpenAI or Google torch to train a single model, €5.5M is rounding error. Yet the reason this announcement got attention across Europe isn't the size of the money — it's the direction. It's a declaration: "We will not leave our language sitting on an American company's servers."

Even the name is symbolic. Amália is borrowed from Portugal's beloved fado singer Amália Rodrigues. It's also an acronym — "Automatic Multimodal Language Assistant with Artificial Intelligence." Naming an AI after the voice that represents your language's very soul isn't just marketing; it tells you exactly what this project is about. It's about defending the sovereignty of language and culture.

So let's take it apart piece by piece — what actually changes, who built it, who benefits, and how realistic this is as a survival strategy for a small nation in the age of giant AI.

Who built it — the government, a telecom institute, and a root called EuroLLM

The one who set the table was the Portuguese government. The money came from the "Recovery and Resilience Plan" (PRR, Plano de Recuperação e Resiliência). This is the large post-COVID recovery fund the EU distributed to member states, and Portugal decided to put part of it not into roads or buildings but into "own-language AI infrastructure." The initial investment is €5.5M, with funding secured through the end of 2027. The very idea that a state treats language capability as public infrastructure — like roads or electricity — is the starting point of this project.

The actual development was handled by a consortium of universities and research centers. Centered on the Instituto de Telecomunicações (IT) and Instituto Superior Técnico (IST) in Lisbon, it brought in NOVA University Lisbon, the universities of Coimbra, Porto, and Minho, and the Foundation for Science and Technology (FCT). Places like the University of Beira Interior, the University of Évora, and the Lisbon School of Engineering (ISEL) joined as collaborators. Translation-AI startup Unbabel was on the list too. More than 60 researchers and students piled onto this — a genuine national project.

The person who coordinated it is André Martins, an IST professor with Unbabel roots. He put it plainly: Amália "aims to contribute to Europe's strategic autonomy in AI," closing the technology gap between Europe and the US and China. The point is that this isn't framed as an academic side-project — it's described in the language of national strategy.

But the real reason all of this was possible lies elsewhere: a shared European open-source root called EuroLLM. Amália didn't start from a blank slate. It rides on top of EuroLLM-9B, a 9-billion-parameter open model built jointly by Europe. The most expensive, heaviest part of a language model (large-scale pretraining) was already made as a shared European asset, and Portugal only had to handle the "extend and specialize" part — layering its own-language data on top. That's exactly why a €5.5M budget makes sense. Instead of reinventing the wheel, they mounted their own carriage on a wheel Europe built together.

Why this structure matters: it could become the standard recipe for how small nations build their own-language AI going forward. Training a frontier model from scratch is something only a handful of superpowers and mega-corporations can do. But layering your own language onto a shared base is something a mid-sized country can afford. Amália is being presented as the first success story of that recipe.

What actually happened — weights, data, and code, all opened up

Let's start with the core: Amália is fully open-source. The word "open" has gotten so overused it's lost its edge, but Amália is close to the real deal. Model weights, the training dataset, and the source code — all three were released under the Apache 2.0 license. Two versions, amalia-llm/AMALIA-9B-0626-SFT and AMALIA-9B-0626-DPO, are up on Hugging Face, and anyone can inspect, modify, and run them on their own servers. Commercial use isn't blocked either.

Technically, they took EuroLLM-9B and did continued pretraining in European Portuguese to reinforce its knowledge, extending the context length to 32k tokens. On top of that they added multimodal capability (handling image plus text) and beefed-up safety and evaluation systems. The training data mixed the Portuguese web archive Arquivo.pt, public books, EuroLLM pretraining data, long-context Stack-v2 samples, and even synthetic "needle-in-a-haystack" data to test context retention. Training ran on Barcelona's MareNostrum5 and Minho's DEUCALION supercomputers, using 64 NVIDIA H100 GPUs — the supervised fine-tuning (SFT) stage took 76 hours and 14,000 steps, and the preference-alignment (DPO) stage took 12 hours. By frontier-lab standards that's a scale you'd finish overnight, but the goal is different. It isn't the world's strongest model; it's "the model that understands Portuguese best."

The intended uses are strikingly practical too. This isn't about spinning up yet another general-purpose chatbot — it's about slotting the model into places Portugal actually needs it. The government named four uses: an AI teaching assistant, a tourism guide for museums and monuments, a digital assistant for citizen services (e-government), and decision support for the Portuguese Navy. That last one — the Navy — stands out, and that's precisely what shows the heart of "sovereign AI." You can't run defense or security judgments through a foreign company's API. So you need a model you control.

Here's the picture, tidied up.

Item	Details
Base model	EuroLLM-9B (shared European open model, 9B params), continued-pretrained in European Portuguese
Training data	Arquivo.pt web archive, public books, EuroLLM pretraining data, Stack-v2 long-context, synthetic data (32k-token context)
Infrastructure	MareNostrum5 & DEUCALION supercomputers, 64 NVIDIA H100 GPUs (SFT 76h / DPO 12h)
Funding	Portugal's Recovery and Resilience Plan (PRR), €5.5M, secured through end of 2027
License	Apache 2.0 — weights, data, and code all public, commercial use allowed
Target uses	Teaching assistant, tourism guide, e-government citizen assistant, naval decision support

The line that really jumps out in that table is the license. Most "open" models toss out just the weights and hide the data or code, but Amália opened all three. The logic: transparency is trust, and trust is what lets you use it in public services.

What each side gains — the slice the government, researchers, and the EU each take

First, what the Portuguese government gains is control. When you put AI into sensitive areas like citizen services, education, or defense, and that model lives in an American company's cloud, both the data and the policy end up in someone else's hands. Raise the price, cut the service, change the terms — you get dragged along. Owning your own model breaks that dependency. Think of €5.5M as the price of buying that independence. Compared to what you'd pay a frontier model's API every year, a home-grown model you build once and use for years could even be cheaper in the long run.

What researchers and universities gain is capability. More than 60 researchers and students actually ran the whole pipeline of extending, aligning, and evaluating a 9B-class model by hand. That experience is a national asset far more valuable than a few papers. The people who'll build the next model, and the one after, are being trained right now. Supercomputer know-how, data curation, safety evaluation — all of it accumulates domestically.

Portuguese-speaking industry and startups win too. Since it's released under Apache 2.0, anyone can build their own products on top of it. Building call-center automation, legal-document summarization, or medical-consultation assistants, you can start from a domestic base that genuinely understands European Portuguese, instead of wrestling with an American model's awkward Portuguese. Brazilian and European Portuguese differ in subtle but real ways, and most global models lean toward the Brazilian variant. Amália aims squarely at that gap.

Finally, the EU as a whole gains something. Amália grew on the shared root of EuroLLM, and it gave its improvements and data back out in the open. This is the first proof of a virtuous cycle where multiple European nations build their own-language models while jointly nurturing a shared base. If Spain, Greece, and Poland build their own-language models the same way, the whole EuroLLM ecosystem thickens. Europe's answer to the closed empires of the US and China is "everyone builds their own, but shares the root" — and Amália is that model case.

This has been tried before — the lights and shadows of national language models

"Our language, by our own hands" isn't a new project. There have been successes and failures, and to see Amália clearly you have to know that history.

On the success side, the one people often cite is the UAE's Falcon. Fueled by sovereign wealth, it shipped open-weight models and once topped open-model leaderboards, leaving the impression that "an oil nation does AI too." Finland-led European multilingual model Poro, and the Arabic-specialized Jais, also showed that own-language and regional-language AI can actually come out usable. Jais in particular turned into real services in the Arabic-speaking world, proving the commercial viability of region-specialized models.

There's also plenty on the failure side, or the side that fell short of expectations. Quite a few projects made a grand announcement and then quietly vanished, never making it to real deployment or maintenance. The trouble usually hits after the launch. Harder and more expensive than training a model once is the "operation" — continuously updating it, managing safety, and wiring it into actual services. Given that Amália's funding structure runs through the end of 2027, that's exactly the point that will decide things. Shipping a first model and having that model still alive five years later are completely different stories.

Another lesson is the "base choice." A good share of past national models tried pretraining from scratch and collapsed under budget and staffing. Amália riding on EuroLLM is a design that dodged exactly this failure. Leave the heavy part to a shared asset, and spend your own budget only on the parts that truly differentiate (own-language data, alignment, evaluation). That judgment likely pushes Amália toward the "actually runs" side of the line between "just another announcement model" and a working one.

Rivals and comrades — other sovereign AIs and European open models

Widen the frame around Amália and you'll see the world is in the thick of a "sovereign AI" race. It's the trend of each nation or region wanting its own instead of depending on someone else's model, and Amália is one piece of the European edition.

Its closest comrade is obviously EuroLLM and the open-European camp around it. On top of that, an EU-level OpenEuroLLM alliance is pushing a shared model spanning multiple European languages. Amália is, in effect, the "nation-specialized node" of that camp. France's Mistral has a different flavor — it's a commercial startup taking a European frontier-grade model straight at the US labs. So even within Europe there are two tracks. One, like Mistral, is "Europe ships a world-leading commercial model too"; the other, like Amália, is "each country layers its own language onto a shared open base." The two aren't so much rivals as complements.

Go outside Europe and there are players operating at a different scale. The UAE's Falcon, Saudi Arabia's massive Arabic-AI investments, India's many own-language model projects, and China's flood of open models. Most of them spend far more money than Amália. So if you line Amália up against them on parameter counts or benchmark scores, it can look shabby. But the axis of comparison is wrong. Amália's competitive goal isn't "world's strongest" — it's "the model that's best at Portuguese and that Portugal fully controls."

There's an interesting contrast too. Around the same time, data-center investment is also flowing into Portugal — there's word that Nscale is building a data center on the order of €695 million. Put a €5.5M model next to a €695M data center and you get a feel for how many layers this AI-sovereignty race is being fought on. Compute infrastructure (data centers) you buy with money; language capability (models) you make with your own data and research — you need both to be truly independent.

So don't read Amália as "Europe's strongest model." Read it instead as a reference for "how a small nation secures sovereignty on an affordable budget." If it succeeds, that methodology itself becomes a blueprint other mid-sized and small nations can copy-paste.

So what changes — for citizens, developers, small nations, and the open-source community

For Portuguese citizens, there won't be a dramatic change right away. Amália isn't a consumer chatbot; it's a "platform" built so governments and institutions can layer services on top of it. But over time you'll feel it. The government-service bot answering in real Portuguese instead of stilted translationese, the museum guide understanding your own cultural context, the school's assistant AI running in sync with the European-Portuguese curriculum. And along the way, your personal and administrative data being processed domestically instead of on American servers — that's a quiet but big difference.

For developers, this is a gift box. With weights, data, and code all open under Apache 2.0, the starting line has moved for anyone building Portuguese-language products. The fine-tuning ingredients are public, and you can run the whole thing on your own servers (on-premises), so you can experiment without worrying about API costs or data leakage. Especially in heavily regulated medicine, law, and finance, "the data never leaves the building" is a decisive advantage.

For other small nations, Amália is a kind of proof. It showed, as a real case, that "we too can have own-language AI without a frontier lab, on an affordable budget." The recipe is clear too — pick a shared open base (like EuroLLM), curate your own-language and cultural data, tap public supercomputers, and concentrate the budget only on differentiating points. The smaller the population and the more a language falls outside global models' attention — think Greece, Hungary, the Baltic states — the more urgent this recipe is, and Amália handed them a blueprint to copy.

For the open-source community, it's a symbolic win. It showed at a national scale that "fully open (weights + data + code) is realistically possible for a publicly funded project." And the structure of taking from EuroLLM and giving improvements back out in the open is a state-level enactment of the ideal that open source isn't mere "free distribution" but a "cycle of growing a shared asset together." If this runs well, it'll keep cracking the assumption that closed frontier models are the only answer.

Of course there are still plenty of question marks. What about funding after 2027? Will the first model's actual performance be enough to carry commercial services? Are the people and budget there to sustain maintenance and updates? The answers to those questions won't come from the launch materials — they'll come from the next few years of operation. Still, the direction is clear. In an age where only vast capital was supposed to be able to build AI, a small nation actually walked a different path — "a shared root plus its own-language data." That alone makes Amália worth recording.

🥄 Three Things You're Probably Wondering

— Is €5.5M really cheap for building a whole model? By frontier-model standards, absurdly cheap. A GPT-class model costs tens to hundreds of millions of dollars just to train. Amália pulled it off on this budget because they didn't build everything from scratch. They inherited the heavy base — Europe's shared EuroLLM-9B — for free, and only did the "extend and specialize" part of layering own-language data on top. The actual training wrapped up in 76 hours of SFT and 12 hours of DPO on 64 H100s. The real cost, though, isn't training but "what comes next" — maintenance, updates, safety management, wiring into live services. Given that the €5.5M runs through the end of 2027, how they fund that ongoing cost will decide success or failure.

— Does this have anything to do with Brazilian Portuguese? Amália is explicitly a model specialized for European Portuguese. That's the core differentiator. Most Portuguese speakers in the world are in Brazil, so global commercial models mostly lean toward Brazilian vocabulary and grammar. European Portuguese differs subtly in pronunciation, spelling, and phrasing, so it often reads as awkward when a Lisbon native uses it. Amália aimed at exactly that gap. Amusingly, the launch happened at the PROPOR conference held in Brazil — the languages split and specialize, but the academic community stays together.

— Naval decision support? Why is that even in a language model's use list? It looks out of place, but that's precisely the part that reveals the essence of "sovereign AI." You can't run judgment support for defense and security through a foreign company's API. You can't have someone else holding where your data flows, when the model changes, or whether the service gets cut off. If it's a fully open model where you control even the weights, you can drop it onto an air-gapped network and run it strictly domestically. The Navy sitting oddly between soft uses like education and tourism may look strange, but it's actually the sharpest answer to "why must it be a domestic model at all."

References

Numbers and criteria are as of announcement and may change.

Frequently Asked Questions

What is the article "Portugal built its own language AI, 'Amália,' for just €5.5M" about?

Portugal stacked its own-language data on top of the European open model EuroLLM-9B to ship Amália, its first Portuguese LLM. Weights, data, and code are all open, and the budget was a mere €5.5M — a small nation's language-sovereignty experiment begins.

Why is this news important?

On July 1, 2026, Portugal announced something quiet but surprisingly weighty. It's called Amália. It's the first large language model (LLM) built specifically for Portuguese, refined from the ground up by Portuguese hands. The headline isn't that it's a giant model some US frontier lab burned billio

Which companies or organizations are mentioned in this article?

The key entities covered in this article include 소버린 AI, 오픈소스, 포르투갈, EuroLLM.

When was this article published?

This article was published on 2026-07-05 by spoonai.

What is the original source of this article?

The original source is AMALIA — Instituto de Telecomunicações (https://www.it.pt/News/NewsPost/5258).

What are the main topics covered in this article?

This article covers: A nation of ten million just built its own language AI, Who built it — the government, a telecom institute, and a root called EuroLLM, What actually happened — weights, data, and code, all opened up, What each side gains — the slice the government, researchers, and the EU each take, This has been tried before — the lights and shadows of national language models.

출처

← 홈으로 돌아가기