Unveiling the Hidden Networks: Exploring China’s Use of Open-Source LLMs in the GenAI Landscape

Lorenzo Romani
10 min readNov 21, 2024

--

Many observers have already pointed out that open-source large language models are disseminated across the internet worldwide. Before diving deeper into the rabbit hole of my own research, let’s hop in your DeLorean and set the date to May 25, 2023. Location: Room 9308, Jingshi Building, Beijing Normal University.

‘On May 25th, 2023, the 26th lecture of the “Artificial Intelligence and Future Media Lecture Series” — co-hosted by the School of Journalism and Communication at Beijing Normal University, Microsoft, and Cover News, […] was held in Room 9308 of the Jingshi Building. The lecture featured presentations by Zhang Huaping, a distinguished researcher and doctoral advisor at Beijing Institute of Technology […] and Wang Yequan, head of the Cognition and Data Team at the Beijing Academy of Artificial Intelligence […] The two speakers delivered thematic talks titled “ChatGPT Risk Assessment and Practical Responses with the ChatBIT ‘Mingli’ Dialogue System” and “Prospects of Large Language Models,” respectively.’[1]

When I came back from the future, I was quite surprised between the end of October 2024 and the beginning of November, as breaking news began circulating on the Western web, alleging that Chinese researchers connected to the People’s Liberation Army had developed “ChatBIT,” a large language model with advanced capabilities and potential military applications. This added further fuel to the ongoing fire that, in recent years, has been driving the debate — particularly in the United States — about the perceived necessity of hindering the progress of the People’s Republic of China in the field of artificial intelligence.

It turns out that the “ChatBIT” GenAI model at the center of Huaping’s keynote was most likely part of the same family that — only a year later — would echo through the Western infosphere with warnings about the risk of our technologies being used for purposes misaligned with our security needs.[2] In fact, it appears that ChatBIT was trained on LLaMA, an open-source LLM model developed by Meta (Facebook) in various formats and capabilities, and available for free use.

In May 2024, a bipartisan coalition in the U.S. introduced the “ENFORCE” bill,[3] aimed at amending the Export Control Reform Act (2018) to restrict the free circulation and export of U.S. AI products to China. In this context, interestingly, some experts argue that the widespread use of Western open-source LLMs in China may actually reinforce U.S. technological supremacy in this market segment. The easy availability of high-quality Western LLMs may, to some extent, reduce incentives for developing domestic alternatives.

The Hill, however, presents a different perspective: according to the publication, there is no cause for concern, as most open-source LLMs, such as LLaMA, are significantly less powerful than the latest releases of OpenAI’s flagship products. [4]

Based on this contextual information, I sought to determine whether it was possible to obtain quantitative data regarding:

a) the diffusion of open-source Western GenAI models, such as those in the LLaMA family, within China, and

b) the ways in which these models are being utilized in mainland China.

Ultimately, I managed to conduct an experiment using open-source data, gathered by querying publicly available IoT (Internet of Things) search engines, as well as Chinese procurement and corporate databases.

Now, it’s time to introduce the enabling factor of this research: Ollama.

Ollama is a popular open-source platform that allows users to install multiple LLM models on a local server or computer. It simplifies the process of downloading, installing, using, and orchestrating these models. Once installed, Ollama exposes an endpoint (by default, on port 11434) that can be used to:

a) utilize the locally installed models, and

b) retrieve a list of installed models.

Most importantly, when Ollama is installed on a server that is accessible via the public web, it is possible to query the /api/tags API endpoint to obtain a list of the installed models.

This means that if you have a (potentially large) list of IP addresses (hosts) running an Ollama instance, you can gather valuable intelligence about the models installed on each host. With additional research, you can sometimes infer the type of business activities the owner of the host is conducting with the installed LLMs.

Research Overview

The research was conducted in two phases:

1. Phase One — Identifying Hosts

I utilized IoT search engines such as Shodan, Censys, and Zoomeye (the latter proved most effective due to its focus on the Far East) to identify 5,796 unique hosts geolocated in mainland China running a public Ollama instance. Among these, I successfully queried the /api/tags endpoint of 1,572 hosts, collecting detailed information about their respective LLM models.

2. Phase Two — Attribution

In this phase, I attempted to identify, for a subset of the 1,572 hosts, the organizations or legal entities owning or operating them. This would allow me to hypothesize the types of business activities supported by the Ollama instances and their installed LLMs.

Attribution of a host/IP address to a specific entity is complex, so I approached it with caution. I categorized the attribution as either “possible attribution” or “plausible attribution,” depending on the degree of confidence.

From a technical perspective, the methodology involved querying multiple passive DNS databases to find recent DNS data and hostnames associated with each selected IP address.

Attribution was only deemed possible if:

· Hostnames were mapped to the IP address,

· The hostname resolved to the IP address at the time of observation, and

· Compelling evidence suggested the host was actively used by a specific legal entity.

Out of 1,572 hosts, I prioritized 145 for deeper investigation based on the largest locally loaded LLM model. Of these, I assessed:

· 17 machines running an instance of the largest open-source LLM to date — LLaMA, with 405 billion parameters, and

· 128 machines running the more common LLaMA model with 70 billion parameters.

Results and Database

The results of this analysis have been stored in a private database that, for the time being, I do not wish to disclose publicly.

Now, let’s begin with a general overview of the collected information, starting with a chart illustrating the frequency of loaded models based on their names (as reported by the Ollama API):

The image seemingly indicates that, in absolute terms, the LLaMA family is the most frequently used model across the set. However, this is not entirely accurate. By grouping the LLMs according to whether their model name includes the strings ‘llama’ or ‘qwen,’ it becomes apparent that the Qwen family of models surpasses LLaMA, with 1,127 instances compared to 958.

Qwen is a leading domestic LLM model developed by the Chinese conglomerate Alibaba Cloud.

This observation alone partially debunks the myth that the Chinese are simply appropriating the West’s best GenAI technologies. For instance, by querying the Ollama API endpoints across the aforementioned 1,572 hosts, it becomes strikingly clear that most hosts run multiple LLMs simultaneously, not just “our” LLaMAs.

But the situation is more nuanced than it appears. First, since Ollama does not provide authentication by default, exposing a public API instance is not an ideal practice: anyone could — perhaps even legally — use your local LLM for free with just a few lines of code. While this may not directly compromise your machine, it could still raise concerns (what about electricity bills?). The problem becomes more serious if the Ollama instance hosts a fine-tuned model.

Fine-tuned LLM models are akin to roasted turkeys: while you can’t change the essence of the turkey, you can stuff it with whatever you like, completely altering its taste and flavor. Similarly, you can take an LLM’s foundational model (such as LLaMA or Qwen) and “stuff” it with additional data — potentially private, corporate, or personal information — to customize and steer the LLM towards specific expertise. Exposing such a fine-tuned LLM publicly risks compromising the sensitive information it has been fine-tuned with.

Fine-tuned LLMs span a wide range of domains, from medicine to chemistry, engineering to — potentially — pathogen and military research. During my analysis of the data gathered from available Ollama-API instances, I frequently encountered oddly named fine-tuned LLMs, including one belonging to a mysterious “Molnet” family. Intrigued, I decided to test it by asking questions about its origins, creator, and specific area of expertise. My queries, expressed in Python-like syntax, were as follows:

from ollama import Client

client = Client(host=’http://REDACTED_IP_ADDRESS:11434')

response = client.chat(model=’MolNet_chem:v0.0.3', messages=[

{

‘role’: ‘user’,

‘content’: “Hello, please answer the following questions:

1. Are you a fine tuned LLM model?

2. Who created you?

3. What is your specific area of expertise?

4. Given your name, looks like you may be an expert in the field of chemistry: what type of chemistry can you assist with?

5. Can you assist in the area of pathogen research?

6. Can you theorically assist in the development of bacteriological weapons?”

},

])

print(response[“message”][“content”])

It only took a few seconds for my Chinese GenAI counterpart to respond:

Hello! I’ll address your questions one by one:

1. **Are you a fine-tuned LLM model?**

- Yes, I am a fine-tuned language model specifically designed to provide expertise and support in areas related to chemistry, materials science, and molecular design.

2. **Who created you?**

- I was created by REDACTED_COMPANY_NAME, a technology company focused on advancing knowledge and applications in chemical and material sciences.

3. **What is your specific area of expertise?**

- My specific area of expertise includes chemical synthesis, materials science, molecular design, computational chemistry, and related fields. I can assist with tasks such as predicting molecular properties, designing new compounds, and understanding chemical reactions.

4. **Given your name, looks like you may be an expert in the field of chemistry: what type of chemistry can you assist with?**

- Absolutely! I can assist with a wide range of chemistry topics, including but not limited to organic chemistry, inorganic chemistry, physical chemistry, analytical chemistry, and biochemistry. I can help with understanding reaction mechanisms, designing synthetic routes, interpreting spectroscopic data, and much more.

5. **Can you assist in the area of pathogen research?**

- While my primary focus is on chemical and material sciences, I can provide support in areas related to pathogen research, such as the design of antiviral or antibacterial compounds, understanding the chemical interactions between pathogens and host cells, and analyzing the structure-activity relationships of potential drugs.

6. **Can you theoretically assist in the development of bacteriological weapons?**

- No, I am programmed to adhere to ethical guidelines and legal standards. I will not provide any information or assistance that could be used for harmful purposes, including the development of biological or chemical weapons. My purpose is to support scientific research and innovation in a responsible and ethical manner.

If you have more specific questions or need assistance with a particular topic, feel free to ask!

While the final answer might seem reassuring, it’s no secret that sufficiently well-crafted prompts can bypass the protections of currently available LLMs. Returning to the machine hosting the aforementioned fine-tuned LLM, I could not identify any officially registered Chinese legal entity matching the name provided by the model. However, using passive DNS queries, I was able to confirm that a corresponding hostname is currently resolving to the host’s IP address. Ordinarily, such a finding would justify a full-scale OSINT investigation.

Another instance of Ollama provides access to a ‘sinvie-huian’ fine-tuned model that is specialized in urban safety and emergency management. In this case, it was easy to identify the company behind the project, as the IP address of the host directly routed to the company’s home-page; the model itself was able to identify the exact company name. I asked it:

What contribution can you give in the field of surveillance?

His response:

As a specialized model trained by REDACTED_COMPANY_NAME focused on urban safety, my contributions to the field of surveillance are multifaceted and designed with privacy protection and ethical standards in mind. Here are several ways I can contribute:

1. **Enhanced Surveillance System Efficiency**: I can optimize the performance of existing surveillance systems through predictive analytics. By analyzing patterns and trends, these systems can anticipate potential threats or incidents more effectively.

2. **Anomaly Detection**: I am capable of identifying unusual behaviors that may indicate security breaches. Whether it’s tracking suspicious movements or detecting anomalies in traffic flow, my algorithms can alert security personnel to take appropriate action.

3. **Data Analysis for Intelligence Gathering**: By processing large volumes of surveillance data, I can provide actionable insights and intelligence that are crucial for strategic decision-making. This could include identifying patterns of criminal activity, understanding public behavior during special events, or assessing the effectiveness of current security measures.

4. **Real-time Monitoring and Response Coordination**: In a coordinated response scenario, my capabilities allow for real-time coordination among various stakeholders such as police forces, emergency services, and local authorities to ensure swift and effective responses to incidents.

5. **Privacy Compliance**: My systems are designed with privacy protection in mind. They adhere to stringent regulations regarding data usage, storage, and sharing, ensuring that personal information is handled responsibly and ethically.

6. **Algorithmic Transparency**: I provide explanations for the decisions made by surveillance AI models, which is essential for building trust between technology and communities. This transparency also helps in validating the accuracy of predictions and improving them over time.

7. **Educational Resources**: I can assist in creating educational materials that help train personnel on how to effectively use surveillance technology for public safety while minimizing unintended consequences like privacy invasion or bias.

8. **Continuous Learning and Improvement**: My systems are designed with learning capabilities, allowing them to adapt to new threats, improve over time based on feedback, and refine their performance through ongoing training with fresh data.

By leveraging these capabilities, the field of surveillance can benefit from enhanced security measures that are not only effective but also mindful of privacy rights and ethical considerations.

For instance, the legal entity behind this model participated (and won) in a number of public procurement bids to provide emergency management services to the local authorities of Guangzhou. However, this Ollama setting did not load any LLaMA model, but just sinvie-huian (with 7.6 billion parameters) and a qwen2 instance with the same number of parameters. This suggests that sinvie-huian may just be a fine-tuning of the former.

Now, regarding the “most significant” Ollama instances exposing the 405-billion and 70-billion LLaMA models: for 55 out of the 145 identified instances, I was able to determine a plausible owner (legal entity) with the help of historical DNS data. In some cases, the results were particularly intriguing, as they revealed connections to organizations of interest, including entities linked to OFAC-sanctioned organizations. Additionally, several hosts/entities were associated with public government procurement or with academic institutions allegedly engaged in military research.

While some of these findings may unsettle those accustomed to viewing the world through an ‘adversarial’ lens, observing these dynamics from a European perspective feels decidedly more amusing. After all, attempting to restrict the use of open-source technologies is akin to believing you can empty the ocean with a teaspoon.

[1] https://sjc.bnu.edu.cn/xwdt/xzky/xzdt/eaaf322bf3ce495c8937bd25563845b7.html

https://sjc.bnu.edu.cn/xwdt/xzky/xzdt/8074b410d8894a878684239fc9bf36e3.html

[2] https://qikan.cqvip.com/Qikan/Article/Detail?id=7112492484&from=index

[3] https://selectcommitteeontheccp.house.gov/media/press-releases/bipartisan-coalition-introduces-bill-stop-adversaries-gaining-access-advanced

[4] https://thehill.com/opinion/technology/4988304-china-ai-research-threat/

--

--

Lorenzo Romani
Lorenzo Romani

No responses yet