GPL compliance and permissive training data theory

This is the second post within a new series that I might start one day, about how companies abuse common misunderstanding of the GNU General Public License (GPL) to sell their stuff. Today, a slightly scary example. Scary, as it is so off the point.

The company Exafunction, Inc. claims that with their product “Codeium” they can provide intelligent programming assistance, based on a large language model (LLM). Just like Copilot of GitHub, Inc. and even better as they do not infringe any license and specifically not the GPL. Their writeup “GitHub Copilot Emits GPL. Codeium Does Not.” provides an adventurous interpretation about the GPL: You need consent to use it in a commercial context. Moreover training your model on purely permissive-licensed code will free you of any legal trouble.

Things are slightly different. Strange that nobody told them in their “… early conversations with the open source community”.

The GPL does not restrict commercial use. It does not even refer to it at all. You are fine in any fields of endeavour as long as you respect and fulfil its obligations.

The main problem with generative AI and the current ML-based programming assistants is that you cannot trace verbatim copies of code to its origin. Due to that you cannot fulfil the most essential obligation of any Free and Open Source Software license: attribution. Calling out the original authors.

It does not help if you train your model with just permissive-licensed code. You will infringe the underlying licensing terms if you do not provide any reference to the original authors and license(s). No matter if it is a permissive or copyleft license. Either way you will not have a valid legal base, speak license, to re-use the original work and it is as bad as any copyright violation with all of its consequences.

For more details or before starting the marketing campaign of your new programming assistant, it could be worth to take a closer look, for example at the ongoing GitHub Copilot litigation and its underlying motivation.

So long, Twitter

It must have been somewhen 2009 that I had my first account with Twitter, Inc. I was not really sure what to do about it. How can you have a serious conversation or share useful information when you are limited to just 140 characters? Somehow this service still felt relevant.

Over the years, it became increasingly valuable. It enabled me to follow amazing people, to read enriching discussions, and to benefit a lot from the shared knowledge, opinions, and all the useful pointers to stunning resources on the Internet. Not only for me personally, but also on a professional level. At some point in time, I came to the conclusion to close all other social media handles – namely orkut, StudiVZ, Facebook, LinkedIn, Xing – as compared to Twitter, the output was close to zero and I wanted to focus on what became helpful to me.

Yet, my own contribution to the dialogue remained little. I tweeted about new additions to my online presences, took part in a few conversations, and mostly re-tweeted quite some stuff that I either considered worth getting the required attention or which I just appreciated to being pointed at.

All the time I did not feel that comfortable of the fact that I am the actual product, as its service remained free of charge. Is a “Like” already way too personal, so they’d be able to precisely profile me?

I highly value everything that the Twitter staff has done to make the platform the way it came to be … was.

Now is time to move on. The Fediverse is in every dimension much more close to what I was originally looking for. And I have found my way to get there.

So long, Twitter – Inc. and its community … thanks for the amazing ride, you definitely made history!

Apple Support Experience

Bis vor kurzem hätte ich die Firma, aber speziell die Hardware aus Cupertino nahezu uneingeschränkt empfohlen. Ist zwar ein goldener Käfig, macht aber das was es soll, recht zuverlässig und das langanhaltend. Hat man doch mal Hilfe gebraucht, wurde geholfen und die Qualität war ok.

Meine Apple Watch Series 5 hatte ich nun 3 Jahre. Das schicke und stabile Modell aus Edelstahl für knapp 800 EUR Neupreis. War oft damit schwimmen, bisher nur in Süß- bzw. Chlorwasser. Im Urlaub ging es nun das erste Mal ins Salzwasser. Natürlich erst nach Prüfung der Eignung. Meerwasser kein Problem und wie bekannt, kein Springen, Tauchen oder sonstige Aktivitäten, die einen zu hohen Druck erzeugen könnten. Noch am gleichen Tag des Meerkontaktes stellt die Smart-Watch mit einem grell-weißen und abschließend roten Leuchten ihren Dienst ein. Komplett. Kein Zurücksetzen möglich. Ein Laden überhitzte die Uhr, ohne jegliche Reaktion. Auch nach Tagen.

Zum Glück gibt es ja den Apple Support. Ein kurzer Blick auf die Homepage förderte als einzige Option außerhalb der gesetzlichen Gewährleistungspflicht einen pauschale “Gebühr für Serviceleistungen außerhalb der Garantie” von 430,90 EUR zu Tage. Eindeutig zu viel. Sportuhren mit sehr ähnlicher Funktionalität und einer garantieren Wasserbeständigkeit ohne wenn und aber von bis zu 50 Metern kosten neu deutlich weniger. Gebrauchte Modelle des gleichen Apple Watch Modells erhält man im Kleinanzeigenmarkt für um die 200 EUR.

Continue reading Apple Support Experience

HowTo: Migrating your microblogging from Twitter to Mastodon

A few days ago we just got yet another reason to leave centralized, a-social networks behind. You probably do not want crazy billionaires serial innovators to reinnovate your virtual neighborhood without mercy. Aside of that there are the secret timeline algorithms, of which is little known beside that they primarily amplify hatred and biased, extremist opinions for the sake of maximizing impressions, users’ interactions, and platform revenue. At the same time one’s own personal timeline is being polluted and tampered with all kinds of paid advertisement whilst the underlying personal data of each and every user is generously sold in all directions.

But wasn’t this all suppose to be just about “… connecting with friends and the world around you”?

Exactly. But for that we have the Fediverse and when it comes to microblogging there is Mastodon. As you probably have already heard about those, let’s directly dive into organizing your Twitter exodus step by step.

Continue reading HowTo: Migrating your microblogging from Twitter to Mastodon

GitHub Copilot – Your AI-powered accomplice to steal code?

Last week GitHub and its parent company Microsoft announced “GitHub Copilot – their/your new AI pair programmer”. E.g. The New Stack, The Verge or CNBC have reported extensively about it. And there is a lot of buzz around this new service, especially within the Open Source and Free Software world. Not only by its developers, but also among its supporting lawyers and legal experts, although the actual news is not that ground breaking, because it is not the first of its kind. Similar ML-/AI-based offers like Tabnine, Kite, CodeGuru, and IntelliCode are already out there, which have also been trained with public code.

Copilot currently is in “technical preview” and planned to be offered as commercial version according to GitHub.

Illustration: GitHub Inc. © 2021

The core of it appears to be OpenAI Codex, a descendant of the famous GPT-3 for natural language processing. According to its homepage it “[…] has been trained on a selection of English language and source code from publicly available sources, including code in public repositories on GitHub”. Update 2021/07/08: GitHub Support appears to have confirmed that all public code at GitHub was used as training data.

GitHub is the platform where the majority of source code of the global Open Source community has meanwhile been accumulated: 65+ million developers, 200+ million repositories (as of 2021) or 23+ million owners of 128+ million public repositories (as of 2020). Alternatives to it have become scarce as long as you do not want to host it on your own.

Great, in what amazing times we are living in! Sounds like with Copilot you do not need your human co-programmers any longer, who assisted you during the good old times in form of pair-programming or code review. Lucky you and especially your employer. On top you will save precious time because it will help you to directly fix a bug, write typical functions or even “[…] learn how to use a new framework without spending most of your time spelunking through the docs or searching the web”. Not to forget about copying & pasting useful code fragments from Stackoverflow or other publicly available sources like GitHub.

At the same time, two essential questions arise, in case you care a bit about authorship:

  1. Did the training of the AI infringe any copyright of the original authors who actually wrote the code that was used as training data?
  2. Will you violate any copyright by including Copilot’s code suggestions in your source code?

Let’s not talk about another aspect that GitHub mentions in their FAQs – personal data: “[…] In some cases, the model will suggest what appears to be personal data – email addresses, phone numbers, access keys, etc. […]”

Continue reading GitHub Copilot – Your AI-powered accomplice to steal code?

The impact of Open Source within the European Union

The results of the Open Source Impact Study tasked by the European Commission have been widely discussed mainly because of its numbers. Though being announced just now, the study identified for the year 2018 a contribution of 0.4% to the GDP worth EUR 63 billion by FOSS, if measured by the increase in commits. 10% more contributors would even raise the GDP of the European Union by 0.6% (EUR 95 billion). The overall cost-benefit ratio is estimated with at least 1:4.

But it gets even more interesting, when looking into the results of the accompanying survey covering about 900 stakeholders (mainly companies) from all around Europe.

For them, incentives for using and investing in Open Source have been, sorted by relevance:

  1. finding technical solutions
  2. avoiding vendor lock-in
  3. carrying forward the state of the art of technology
  4. knowledge creation

As benefits they have seen:

  • support of open standards and interoperability
  • access to source code
  • independence from proprietary providers of software

Within the participants the cost-benefit ratio has been estimated even with 1:10.

Quite some news outlets have reported about the presentation of the study’s findings at the OpenForum Europe Policy Summit 2021, though the final report to the Commission is still pending.

English: “How much are open-source developers really worth? Hundreds of billions of dollars, say economists” by Daphne Leprince-Ringuet
German: “Studie: Open Source trägt 95 Milliarden Euro zur EU-Wirtschaftskraft bei” by Stefan Krempl

Update 2021/02/15 – Netzpolitik.org hat heute auch ein Interview mit dem maßgeblich an der Studie beteiligten Innovationsforscher Knut Blind veröffentlicht: “Open Source braucht öffentliche Finanzierung” von Alexander Fanta

Update 2021/09/06 – The full report has now been published: “Study about the impact of open source software and hardware on technological independence, competitiveness and innovation in the EU economy”.

Virtual Conference Experiences

The current circumstances also forced conferences (those gatherings with really large audiences) completely into cyberspace. Some sticked with traditional approaches to stream talks via off-the-shelf videoconferencing applications and built upon the integrated very limited interaction features offered by these poor proprietary tools. Others have gone complete new ways and brought fascinating and well working concepts on how to still successfully connect the crowds to enable lively conversations and facilitate the exchange of knowledge and experiences in a distant environment.

Let’s start with rc3 and its virtual conference venue in form of rc3 world, implemented with Work Adventure. In a pixel-2D-adventure-style you could walk around the area and as soon as you are approaching other characters, a live audio and video stream with those humans or other live forms controlling the character would open. Limited to 4-5 persons at a time, it allowed you to talk directly with each other – face to face. Due to the limitation of participants you were still able to have a working conversation.

Somehow you needed to get used to having an unexpected and sudden interaction with one and another – on live video, but still it brought back the heavily missed opportunity to get in personal touch with other participants who are sharing possibly similar interests.

rc3 world (screenshot by derstandard.at)

The FOSDEM 2021, the worlds biggest conference on Free and Open Source Software usually taking place in Bruxelles, had for me a very convincing overall concept. The organizers and infrastructure artists have done a tremendous job that allowed for the most impressive conference experience so far and for long. Naturally and purely based on Free Software, at its core matrix, element, and Jitsi.

How did it work and what was so great about it?

Presentations of specific areas of interest had been summarized in virtual rooms with a fixed agenda, like in most physical conferences. Participants logged into a chat infrastructure which represented the rooms by group conversations. You would simply join the room(s) that you are interested in and could start texting with each other and the speakers like on IRC. Talks had been recorded beforehand and where automatically started – by the computer (systemd) – at their scheduled time. Its audio and video were streamed right above your chat window. When the talk ended, the Q&As were streamed live for a fixed amount of time within that room until the next talk started auto-playing according to schedule. During that first part of the Q&A session of a talk, moderators where clarifying upvoted questions and comments from the chat and interacting realtime with the presenters. Those interested could then continue discussing with the speakers and further extend their conversation by switching to a separate room. So per talk you had a dedicated room for the second part of the Q&A that would open shortly after and even allowed anyone there to interact live via audio and video.

In sum that meant that you could check the schedule for topics you are interested in, connect at the announced time and be sure to really listen to that talk instead of watching tech staff doing mic checks or heavily delayed earlier talks whilst being unsure about if and when the one you came for would actually start.

In addition the highly valued Q&A and following backstage (and off the record) conversations could still take place without interrupting or being interrupted by the subsequent talk.

Just impressive and so useful! Thanks a lot to all who made this happen and work that well! These concepts are now here to stay, even when conferences will hopefully resume soon back in the physical world.

2021/02/15 – Updated link of [matrix] to point at the now available summary of their efforts for FOSDEM 2021.