A few years back, I was introduced to the world of task and job queues. A friend was writing a tutorial on RabbitMQ, and I’d discovered Gearman as a potential solution for a large task I was working to complete. The idea of leveraging multiple servers to complete a single task in parallel was a foreign concept to me.
Once I’d mastered the idea of concurrent processing, I never looked back.
Learning Curve
The biggest downside of working with new server technologies is often that you need to learn the technology on a server. Keep in mind that I was learning Gearman while still using WAMP for local development; the idea of running a local VM hadn’t occurred to me, so I learned the API by cowboy-coding against a Digital Ocean droplet.
Version control made my lessons (and mistakes) very public, so I worked to learn quickly.
Learning quickly meant I missed a step or two along the way in my education. I haven’t worked with Gearman directly in a few years, so I took some time to brush up on the PHP interface to see what I could remember.
The Job Server
Gearman is split into two components – the server and workers. The server acts as a central “hub” that can accept job information from various clients and make it available to workers when they come online. There isn’t much to the application itself; it’s a daemon that runs persistently on the server and holds job data (both incoming and outgoing) in memory until it’s needed.
Gearman can work with a persistent database backend to keep information around in the event of a reboot. If your application is mission-critical (i.e. billing) or takes a long time to complete (i.e. sending email newsletter subscriptions), maintaining a persistent backend is a good idea.
Installation is straight-forward, thanks to package managers for all of the major Linux server distributions. Some developers have even built Dockerized distributions, if that’s your thing. The point is: you need a server somewhere, listening on port 4730, that both workers and clients can connect to.
Workers
Workers are, usually, simple PHP scripts that can be daemonized on the server. Your approach to daemonization is up to you – in the past, I’ve used Supervisord to keep a few copies of the script running or have Dockerized the process with a restart declaration.
The script itself merely registers a “function” with the Gearman server. Like any other piece of PHP code, this function takes arguments and returns data, it just takes its arguments from a dynamic process (Gearman) and returns data to the same.
Assume for now you have the desire to process PDF documents in bulk on the server, for optical character recognition, machine learning, or some other purpose. There are a lot of documents to import, so you want to parallelize the operation over multiple worker processes (and potentially multiple servers as well). Your code might take the form of a PHPImporter object with a single import() method:
// worker.php
$importer = new PDFImporter();
$worker = new GearmanWorker();
$worker->addServer( 'localhost', 4730 );
$worker->addFunction( 'import', array( $importer, 'import' );
while( $worker->work() );
For the sake of this example, the worker script will run on the same server that’s running Gearman itself, hence the binding to localhost above. In a real-world example, the local reference would be replaced with the IP address of the central server (or multiple servers in a distributed case).
This worker will run forever and will wait until a task is available on the local Gearman hub to process. It exposes a single function to the Gearman server, though it could expose many. Likewise, there could be multiple workers exposing different functions all to the same hub.
The beauty of Gearman is that you can use different languages to build the workers. This example uses PHP alone, but there are libraries for Java, Python, Ruby, Go, and many other languages as well. In theory, you could write have two workers written in different languages expose the same function to Gearman.
The Client
The client is just as simple. It’s an operation within PHP (or your language of choice) that connects to Gearman and publishes a job to be processed. Often, you’ll want to run a job in the background, meaning the main application doesn’t wait for the take to complete before completing itself. This would allow a single application to create multiple jobs at once, then exit back to a browser or command line for further instruction.
// app.php
$client = new GearmanClient();
$client->addServer( 'localhost', 4730 );
$job = $client->doBackground( 'import', $file . '||:||' . $email );
$db->record( $job );
Gearman automatically returns a job ID that can be used to query for the job’s status later on, if required. The example above fits with our PDF import illustration from earlier and instructs a Gearman server to execute a function named “import,” passing it a concatenated string with a filename and an email address for later notification.
Keep in mind that Gearman workloads (the second argument in the invocation) are only strings. In this example we’re passing a filename, but the workload could just as easily be the base64-encoded binary body of the file itself. Gearman supports messages up to 4GB in size – the only limit here is the bandwidth the client has available.
Security
When I first set up a Gearman server, I didn’t fully understand the ramifications of leaving the server in the public space. I was somewhat new to server maintenance, so while I could follow a tutorial to install a new application, I didn’t really understand what was going on. This meant the server I’d built was completely exposed to the Internet – anyone could connect to my Gearman server whether they were running my code or not.
Think about that for a moment.
Anyone who knew I was running Gearman and the IP address I was using could leverage my server for their own purposes. Among other things, they could:
- Enumerate the functions I had registered on the server
- Invoke a function I had registered (even with a malicious or inaccurate payload)
- Register their own functions on the server
Given Gearman’s flexibility, a third party could even register a new worker to manage an already-registered function. Gearman would happily forward along any invocations, failing to distinguish between legitimate workers coded by me and malicious workers coded by a third party.
With the PDF processing server illustrated in the examples above, this might not be too much of an issue. Our machine learning database would be missing a few documents or might have some junk data inserted by a third party. If Gearman were instead being used to parallelize stock trades or process credit transactions or notify medical study participants of information …
The impact of an improperly-secured Gearman server grows with the importance of the application using it.
In the Wild
I use a tool called Shodan on occasion to check that my servers (and Raspberry Pi clusters) are properly locked down while still visible on the public Internet. Shodan is an amazing tool, but it’s also a treasure trove of information for potentially malicious parties as well. In addition to scanning specific IP addresses, it also maintains a directory of machines listening on specific ports.
My last check through Shodan showed over 8,000 individual servers listening on port 4730, the default port for Gearman. I didn’t drill down much further than that,[ref]I did not connect to any of these machines. I merely enumerated the cached listings stored in Shodan’s public database. Accessing someone else’s computer system without authorization is illegal.[/ref] but Shodan ran a simple status query when it indexed these machines; the public database lists both the machines and their registered functions.
Remember, if your machine is visible to tools like Shodan, it’s visible to malicious third parties who can manipulate or steal your data.
Locking things Down
Thankfully, the situation doesn’t have to be dire. Gearman itself doesn’t have any authentication, but the server running it can protect the daemon from abuse. Simply configure your server’s firewall to block connections on port 4730 from everything except localhost and any specific, whitelisted servers. Shodan won’t be able to index you and third parties won’t be able to abuse your installation.
Gearman is a powerful application that you can and should be using to parallelize the processing done by your PHP server. It’s also something that can easily expose you and your team to undue stress if not properly configured and secured. Take some time today to lock your Gearman servers down.
Then take some more time to ensure your other hosted applications are properly protected as well.