Building DGPUNET: Democratizing AI Innovation Through Open Source Infrastructure
Building DGPUNET: Democratizing AI Innovation Through Open Source Infrastructure
October 2, 2025
Over the past several months, I've been working on something that started as a practical necessity but evolved into a philosophical statement about accessibility in AI development. When a startup couldn't get anything better than a pitiful G10 GPU instance from their cloud provider - completely insufficient for the machine learning workloads needed - I realized I had to take matters into my own hands, and home...
The Problem: Back to the Bad Old Days
The current AI landscape has created an uncomfortable reality that feels disturbingly familiar to those of us who lived through the mainframe era of the 1970s. We're witnessing what I can only describe as a return to the "bad old days" - where meaningful AI research and development increasingly requires making offerings to the people in white lab coats, begging for meager smatterings of resources to do our daily work. This G10 was such a blow, because of this startup being just bootstrapped, the cloud provider wouldn't trust them to even have access to real GPU instances until they had first "proven" themselves to the cloud provider longitudinally. The startup was happy to pay for GPU resources, but initially the provider wouldn't even allow a single instance with GPU access. It took FIVE appeals before they granted the tiny single instance of this very low-end GPU. Far below what many consumers now have in their laptops...
Cloud computing costs for serious machine learning work can run thousands of dollars monthly, effectively creating a barrier that excludes many potential innovators from participating in this technological revolution. But it's more than just cost - it's the fundamental shift back to centralized control that troubles me most. As I've written before about computing eras, we're experiencing a regression from the Third and Fourth Ages of computing (PC Revolution and Internet Revolution) back toward the controlled, gatekept access models that characterized the mainframe era.
The hoarding and expense of GPUs has only accelerated this trend. We've moved from the most "free and libertarian" computing era in history back to a model where businesses and individuals willingly, though mostly unwittingly, surrender their computational autonomy to centralized providers. My experience with cloud provider resource limitations exemplifies this perfectly - waiting months for permission to access more than 1,000 spot instances, experiencing what can only be described as "mainframe bureaucracy" in cloud form. This was a well-established multinational software development company I worked for at the time, with over 2,000 developers, and 20+ years history with cloud providers. I was personally granted a $250,000/month R&D budget, and yet I couldn't even start the scaling tests for weeks/months due to the cloud provider's resistance to freeing up resources.
This hits particularly close to home given my work on SIIMPAF (Synthetic Intelligence Interactive Matrix Personal Adaptive Familiar) and AICLPH (AI Large Context Project Helper). These projects consistently maxed out individual systems, including my 24GB RTX 3090, 16GB 4090, and 32 GB 5090, when running the full feature stack. The choice seemed binary: either shell out enormous sums for cloud infrastructure or accept limitations that would cripple my development of theses open source projects.
The Solution: Distributed GPU Network (DGPUNET)
Rather than accept these limitations and continue the cycle of technological dependency, I decided to see if I could cobble together what I nicknamed DGPUNET - a distributed GPU network, using consumer hardware and open source technologies. This approach represents more than just a technical solution; it's a deliberate rejection of the current trend toward centralized cloud and AI infrastructure, at least for R&D. This may not be viable in full production settings, but after doing the math, could potentially save hundreds of thousands, if not millions of dollars in monthly or annual R&D costs compared to the cloud.
As I've discussed regarding microservices and technological bandwagoneering, the key is asking the right questions first: What specific problems are we trying to solve? What are our measures of success? In this case, the problems were clear - insufficient computational resources at reasonable cost, and freedom from vendor lock-in and resource gatekeeping.
DGPUNET Prototype Hardware List
The network currently consists of (except for the Mac, all are running Q4OS Debian Linux):
- Custom-built PC Tower i7 with 32GB RAM and Nvidia RTX 3090 (24GB VRAM)
- Alienware M16 laptop i7 with 64GB RAM and Nvidia RTX 4080 (8GB VRAM)
- Alienware M18 laptop i9 with 64GB RAM and Nvidia RTX 4090 (12GB VRAM)
- Alienware M18R2 laptop i9 with 64GB RAM and Nvidia RTX 4090 (16GB VRAM)
- MacBook Pro M4 Pro with 64 GB RAM
- Soon adding: Dell XPS Tower i7 with 128GB RAM and Nvidia RTX 5090 (32GB VRAM)
All connected through a 10 Gbps switch using Ray clustering - an open source framework for distributed computing that handles the orchestration across different hardware configurations automatically.
Technical Implementation
The Ray framework proved ideal for this heterogeneous setup. While each machine's network interface handles between 1-2.5 Gbps (creating some bottlenecks for parameter sharing), Ray is turning out to rather efficiently minimize data movement and is (so far) doing a decent job managing workload distribution across the cluster, helping bring my AICLPH & SIIMPAF projects closer to being fully realized, and helping me solve the short-to-mid-term R&D needs for the startups until they can get real funding for a full production environment. Even then, I'll likely go with an on-prem-multi-cloud-hybrid setup, with the primary core on-prem and then scaling into the cloud only as needed (a winning formula I successfully implemented with a number of startups, SMBs, and enterprise companies over the decades).
Total cluster resources:
-
416GB system RAM
-
92GB total VRAM across GPUs
-
Distributed compute capacity that can handle models 100B+ parameters with room for larger models using hybrid GPU/CPU inference to 200B+!
The beauty of this approach lies in Ray's ability to handle mixed workloads. Some tasks run on high-VRAM nodes for GPU-accelerated inference, while others leverage CPU resources for preprocessing, postprocessing, and caching.
Here are some example distributions:
GPU-Only Inference (92GB VRAM):
-
70-80B parameter models comfortably
-
Could potentially handle 100B+ parameter models depending on precision (FP16 vs FP8)
Hybrid GPU+CPU Inference (416GB total):
-
200B+ parameter models using techniques like:
-
Offloading layers between GPU and CPU
-
Model parallelism across the distributed setup
-
Quantization (4-bit, 8-bit) to fit larger models
A Few Practical Model Examples:
-
Llama 2/3 70B: Easily fits into VRAM
-
GPT-4 class models (~175B-220B): Possible with hybrid approach
-
Mixture of Experts models: Could handle very large MoE architectures
Some Key factors:
-
Ray's distributed inference can spread model layers across clusters
-
Quantization techniques can effectively double capacity
-
10Gbps network may limit some real-time inference speed for the largest models, but batch processing might help this to still work well
Theoretically, I'm getting closer, to essentially being in the territory where I can run almost any of the currently available open-source models.
Many that would cost thousands per month on cloud providers!
For the cost of 1-3 months cloud hosting of AI, I can instead buy the hardware, a little each month (1 card/system a month, lets say) and re-use each for just the cost of electricity after that, building the power of the cluster each time.
Real-World Applications
This infrastructure directly supports my ongoing research projects:
SIIMPAF Development: The distributed setup allows SIIMPAF to run its full feature stack - document processing, vector search, AI model integration, voice-to-text (OpenAI Whisper), text-to-speech (Coqui TTS), dynamic personality systems, and emotional state tracking - without overwhelming any single system.
AICLPH Operations: The expanded context and session management capabilities that AICLPH provides work much more effectively when backed by distributed resources that can handle large knowledge archives and maintain persistent context across extended sessions.
RPG Research Applications: My work analyzing data from over 100,000 tabletop, live-action, electronic, and hybrid role-playing game participants requires substantial computational resources for pattern analysis and therapeutic gaming research.
Cost Analysis Built on Experience
The financial advantages are significant. This entire distributed setup, assembled from used and refurbished equipment, costs a fraction of equivalent cloud infrastructure. While I won't publish exact figures, the monthly operational cost (primarily electricity) is less than what a week of equivalent AWS or Azure compute would cost.
To be clear, I do have some infrastructure experience. To illustrate a portion of this, a little background. I have a fair amount of hardware and software experience - back to 1979. Later, as CIO at PC Easy, I hand-built several thousand custom PCs myself, then as we rapidly grew, I trained and supervised 12 technicians each building 20+ custom-built systems per day, in the mid-90s.
Later as Director of Operations and Corporate Systems & Components Architect for Franklin Covey in the late 1990s and early 2000s, across 150+ countries, I really grew my enterprise global infrastructure experience. I have designed and built by hand (or supervised teams doing so) entire server farms, colocation facilities, and ISPs from scratch. I was also sent to audit and find the best colocation & SOC facilities in the USA and Canada during that time as well.
All of which helped to increase our freedom of options, weighing the TCO of cloud versus on-prem analysis between equipment, software, support, staffing, insurance, etc.
There is a lot to consider, it isn't just about the hardware when considering larger-scale environments.
To be clear, I am not saying the cloud doesn't have its place, but want to provide a warning to folks not to blindly rely on always using a sledgehammer for picture-frame-nailing.
More importantly, than just the costs, perhaps most importantly in the Data Age, this infrastructure (and data) fully belongs to us.
There are no usage limits, no surprise billing, no service interruptions (if built right), no vendor lock-in, and no harvesting of our own intellectual property (IP) to then turn around and sell it back to us for a fee.
Lessons Learned
Building DGPUNET reinforced several principles I've held throughout my 45+ years in technology:
- Open source tools remain the great equalizer: Ray, along with ollama, vllm, Jan.ai, and other open source AI tools, makes sophisticated AI infrastructure accessible to independent developers and small organizations.
- Consumer hardware can compete: While enterprise hardware has advantages, thoughtfully assembled consumer equipment can provide impressive capabilities at dramatically lower costs - though it is getting increasingly difficult to find people who actually know how to work with hardware.
- Innovation thrives with constraints: Like the true "Hacker" "Gurus" of the mainframe days, who took 1,000 lines of my buggy code and turned it into 100 lines of high-quality code, sometimes, if you don't give up and are determined enough to "find a way", having limited resources can trigger innovative creative solutions that ultimately, as in this case, produced a more flexible and cost-effective system than simply throwing money at cloud services.
Contributing Back
As always, I/we're committed to giving back to the open source community that makes this work possible. Bug reports, documentation improvements, and code contributions flow back to the Ray project and other tools we leverage. This isn't just about taking from the commons - it's about strengthening the ecosystem that enables this kind of innovation.
Looking Forward: Breaking the Cycle
I'm exploring adding my MacBook Pro M4 Pro (64GB unified memory) to the cluster as a CPU-only worker. While the Mac can't participate in CUDA-based GPU inference, Ray's heterogeneous design means it could potentially handle CPU preprocessing, caching, and lighter inference tasks.
The next hardware upgrade under consideration is swapping the RTX 3090 with an RTX 5090 (32GB VRAM) (and putting the 3090 in an older custom tower I have), which would push our capabilities into the 100B+ or 200B+ parameter model range for certain workloads.
The technical specifications are secondary to the broader implications. As I've written about quantum computing and technological transitions, we're approaching another major shift in computational paradigms. Quantum computing will initially be accessible only to large organizations, potentially throwing us "further back to a situation more like the mainframe days prior to the PCs of the 80s."
This makes projects like DGPUNET even more critical. By establishing distributed, self-owned infrastructure now, we're positioning ourselves to maintain computational independence through the next technological transition. When quantum computing does become accessible to smaller organizations, having proven distributed infrastructure and the operational knowledge to manage it will be invaluable.
The Bigger Picture: Technological Liberation vs. Dependency
This project represents more than just technical problem-solving. It's a demonstration that meaningful AI research and development doesn't require massive corporate resources or venture capital funding. With creativity, persistence, and the right open source tools, individual researchers and small teams can build infrastructure that competes with much larger, better-funded operations. In 2021 I was able to build a real-time ASR NLP for closed captions in Jitsi videoconferencing rooms as part of a larger learning management system (LMS) I was working on. Initially each room required 2 server instances, with each instance requiring a minimum of 8 GB RAM and 4 vCPUs. This was not affordably scalable, when we needed to support 20,000+ concurrent users, in several thousand meeting rooms. So, with a helpful intern under my guidance, we worked diligently over the following months (old school "hacking" to make the code better), until it was even faster and more accurate, with only 1 vCPU and 0.5GB RAM needed per server instance (with some clever clustering offloading) using, customizing, and giving back to all open source! The initial main-stream bloated version was much slower and less accurate, the finalized version ended up being 150% faster, and 30% more accurate than Google (at the time)!
The democratization of AI infrastructure through open source tools and consumer hardware helps ensure that innovation isn't confined to a small number of well-funded organizations. That diversity of perspectives and approaches is essential for developing AI systems that also helps better serves broader human needs rather than only narrow commercial interests.
We're at a critical juncture in computing history. Those who cannot remember the past are condemned to repeat it (though the Foucault-like-pendulum of history more often rhymes or hazy-mirrors, than directly repeats), and we're seeing clear parallels between today's cloud-centric AI development and the centralized computing models of the 1970s. The difference is that this time, we have the knowledge and tools to choose a different path, but only if people actually know there actually are other (and perhaps in many cases, better) ways.
DGPUNET represents a small but concrete step toward maintaining the computational freedom that previous generations of technologists fought to establish. It's a reminder that the most transformative computing innovations have historically come from individuals and small teams with limited resources (literal startups in a garage) but unlimited creativity - not from centralized institutions that hoard computational power, and all to often, inevitably ossify into bureaucratic gatekeepers.
For those considering similar projects: the technical barriers are lower than they might appear, the open source ecosystem is remarkably robust (though recently increasingly threatened!), and the cost advantages over cloud solutions become more pronounced as compute requirements grow. The learning curve is steep but manageable, and the independence it provides is well worth the effort.
As I continue refining DGPUNET and expanding both SIIMPAF and AICLPH capabilities, refining them to be open sourced, I'll share more technical details and lessons learned. The goal isn't just to solve our immediate infrastructure needs, but to demonstrate practical paths for others facing similar resource constraints.
The future of AI innovation shouldn't be limited to those with the deepest pockets. Projects like DGPUNET prove there are alternatives.
Hawke Robinson has been working with technology since 1979, is Chief Information & Technology Officer of Practicing Musician SPC, Founder of BciRpg.com, NeuroRpg.com and the non-profit 501(c)3 RpgResearch.com, and as a Washington State Department of Health Registered Recreational Therapist with a background in neuroscience, cognitive neuropsychology, and research psychology, is recognized as "The Grandfather of Therapeutic (Role-Playing) Gaming." His current research focuses on AI applications in education, therapeutic gaming, and large-scale data analysis at various companies
DGPUNET Addendum: Building on Decades of Clustering Experience
Some personal historical context
Early Linux and Clustering Foundations (1994-2000s)
My journey with distributed computing began long before DGPUNET. Since adopting Linux in 1994, I've been building custom systems and exploring ways to harness multiple machines for computational tasks that single systems couldn't handle, from Distributed.net & Seti projects I participated in, to full server clustering and on-prem open source cloud infrastructure.
Beowulf Clusters and Open Source Distributed Computing
In the late 1990s and early 2000s, I worked extensively with Beowulf clustering technology (among others) - one of the pioneering approaches to building high-performance computing clusters using commodity hardware and open source software. Beowulf clusters represented exactly the kind of democratized computing approach that DGPUNET embodies today: taking standard PC components and Linux systems, then networking them together to create supercomputer-class performance at a fraction of traditional costs. I was able to take these early R&D experiments and implement them in full production settings over the years, helping to achieve 99.999% annual uptimes at a fraction of the cost.
The fundamental principles I learned during those Beowulf implementations directly inform DGPUNET's design:
- Commodity hardware can deliver enterprise-class performance when properly orchestrated
- Open source clustering software provides flexibility that sometimes proprietary solutions can't match
- Network topology and communication patterns are critical for distributed workload performance
- Heterogeneous systems can be effectively managed within a single cluster framework
From Expensive Solaris to Cost-Effective Linux
In addition to IBM and other mid-frames and main frames, I also worked with many flavors of true UNIX, AIX, HPUX, and others. My experience with Sun Solaris E-series systems provided valuable lessons in both the power and the limitations of enterprise clustering solutions. While these systems offered robust clustering capabilities that excelled for internal enterprise infrastructure, their cost and vendor lock-in made them impractical for many applications, and they often failed to perform well in public Internet-facing high-demand variability settings.
The transition to Linux-based clustering (as Sun, IBM, and many others eventually also came to realize years later) represented more than just a cost reduction - it was a fundamental shift toward technological independence. Building custom PCs for projects like the University of Utah brain mapping initiatives, I learned to create massive RAID arrays and distributed processing and storage systems that could handle the computational demands of scientific research.
The MightyWords Transformation: $3M Solaris Farm Whupped by A $100K Linux Cluster
One of the most dramatic demonstration of this approach during those years, came during my time as CTO at MightyWords, where we faced a critical infrastructure decision. When MightyWords was still a project under Fatbrain.com (the number 3 online bookseller, second only to Amazon.com and bn.com at the time), they used the "top tier" San Francisco company Scient for writing the $5M+ code for MightyWords 1.0. I came into the picture right after the CEO signed the contract with them, so I had to work with them until 1.0 was finished, as MightWords was now funded and could become its own company. The existing system, designed by the now defunct consulting firm Scient, depended on a $3 million Solaris cluster of 40+ servers that was both expensive to maintain and turned out to be poorly performing (in large part due to Scient's bad design and horrific coding), but also due to some core fundamental issues with this type of infrastructure that wasn't ideal for Internet-facing environments.
Scient's approach exemplified everything wrong with late-1990s dot-com boom enterprise consulting: massive teams (we were assigned 50+ developers - an entire floor in their skyscraper in downtown San Francisco), expensive paired programming methodologies badly implemented, millions in consulting fees, and ultimately, hideous code that failed to deliver on its promises (and I later had to throw the entire code base in the dumpster). Their solution was over-engineered, vendor-locked, and fundamentally unsuited to our needs. 1.0 initially had a 70% support rate! Though 30-50% of that was due to the entire industry demanding to use the terrible Adobe PDF DRM of the time!
The streamlined technology department I assembled consisted of 30 staff handling our entire development and infrastructure operations, supplemented by 20 offshore QA personnel.
Within just a few months, we designed and deployed a replacement Linux cluster built primarily on open source software with minimal custom coding. The hardware foundation was a cluster of VA Linux 1U servers with a total cost under $100,000 - remarkably, I still have several of these VA Linux systems running in my home lab 23 years later, hosting community websites.
The performance results were dramatic: our lean solution delivered 200x better performance (that's not a typo - two hundred times) compared to the original Scient infrastructure. When you consider the total cost comparison - Scient's code development ($5M+) plus Adobe DRM plus the Solaris server farm ($3M+) totaling over $8M - the contrast becomes even starker.
This transformation represented more than simple cost reduction; it embodied a fundamental shift in architectural philosophy. Rather than attempting to scale up expensive proprietary systems, we scaled out using commodity hardware and open source software, proving that smart engineering could dramatically outperform expensive enterprise solutions.
Retaining What Worked: Oracle on Solaris E450s
Importantly, this transformation wasn't about abandoning all enterprise technologies indiscriminately. We retained several Sun E450 systems running Oracle databases because, at the time, MySQL wasn't mature enough for proper transaction handling, and PostgreSQL hadn't yet achieved the scalability we required.
This selective approach - keeping enterprise solutions where they provided genuine value while replacing overpriced, underperforming components with open source alternatives - became a template I've followed throughout my career.
Multi-Platform Expertise: Understanding All Technologies
My approach to these architectural decisions has always been informed by deep, hands-on experience across multiple platforms and technologies. I acquired certifications as a Microsoft Certified Systems Engineer (MCSE, 1998), Certified Netware Administrator (CNA, 1998), Sun Certified Solaris Administrator (2001), SANS Institute Certified Incident Handler (2002), and later became a Novell Certified Linux Instructor (before they went under), and Dell Certified Systems Engineer (2006). This multi-platform background means my advocacy for open source solutions comes from a position of understanding the genuine strengths and weaknesses of all these technologies, not from ideological bias.
The decision to replace the Solaris cluster with Linux wasn't made from ignorance of enterprise Unix capabilities - quite the opposite. Having worked extensively with Solaris systems across multiple continents, and understanding their architectural strengths, I could make informed decisions about when those strengths justified the costs and when open source alternatives could deliver better value.
Similarly, my Microsoft certification and extensive experience provided insight into Windows-based enterprise solutions, helping me understand when mixed-platform approaches made sense versus pure Linux implementations. The Novell & Solaris certifications and experience rounded out my understanding of enterprise networking and directory services.
This broad certification background reinforced a key principle: the best technology choice depends on specific requirements, not vendor preferences or religious adherence to particular platforms. Each technology has its place, but that place should be determined by technical merit and cost-effectiveness, not by vendor marketing or industry fashion.
Learning from Database Evolution
The database component of that transformation also illustrates how technological landscapes evolve. Today's MySQL and PostgreSQL implementations are far more capable than their late-1990s versions, just as today's consumer GPUs are more powerful than yesterday's enterprise solutions.
This pattern repeats throughout computing history: technologies that start in expensive, specialized systems eventually become commoditized and accessible. DGPUNET represents the current iteration of this cycle in AI/ML infrastructure.
Open Source Clustering Technologies: Then and Now
The clustering technologies I worked with in the late 1990s and early 2000s laid important groundwork:
Beowulf Clustering:
- PVM (Parallel Virtual Machine) - early distributed computing framework from 1989
- (*)MPI (Message Passing Interface) - standardized parallel programming interface (still widely used today)
- Custom Linux distributions optimized for cluster nodes - distributions like ROCKS, Warewulf, etc.
- Network-based boot systems for cluster management - PXE boot, diskless nodes, etc.
Modern Equivalents:
- Ray for distributed ML workloads
- Kubernetes for container orchestration - modern equivalent for cluster management
- Apache Spark for big data processing - for distributed data processing
- Slurm for workload management - widely used HPC scheduler
(*)Minor note: MPI is actually still very much in use today in HPC environments, so it bridges both eras.
The fundamental concepts remain the same - distribute workloads across multiple machines, manage resource allocation efficiently, and provide fault tolerance. What's changed is the sophistication of the software and the power of the underlying hardware (sadly, and arguably, while the hardware has advanced, and software-based innovations like blockchain and AI have incrementally evolved, most of the other core programming languages and software development created in the past 15 years is in many ways a severe regression!). But that is a whole other conversation for another time.