Archive for the ‘erlang’ Category

riak_redis_backend : thoughts

Post Mortem

riak_redis_backend has migrated from a gist to a full size github project.

Finally, my code passes all tests I throw at it. It’s been a fun ride up to now. I still have a few ideas to implement but the bulk of the work is done — and it’s not even a 100 LOCs …

I recently spent some quality time with redis and the erldis erlang library for a project.

And last week I decided that Riak would be my “week-end sized tech”. In the wee hours of Monday morning, I decided that writing a Redis backend for Riak should be fun and easy.

It was easy — in hindsight. But I hit a few roadblocks.

The riak_backend_tests passed quite quickly but it was not enough. My own riak test code tendend to be slow and mapreduce jobs would timeout.

At first I thought about improving the performance of erldis (the erlang Redis library). With the help of Jacob Perkins, the erldis maintainer, we identified where to improve performance. Basically it consisted in moving from strings/lists to binaries. Performance was improved and 50% better.

Code available here

I also removed many useless checks on keys belonging to sets. Just write every time as data in that case is very short and redis has very good write performance.

But my code still had mapreduce failures.

It all came from a misconception about that Partition argument in the start function … I ignored it and I was wrong. All the riak_redis_backends would connect to the same key space and exchange uselessly information until timeout.

I tried to have partitions connect to different redis databases. Not good as a Redis server can only have 16 databases. [update : this is by default. Redis can be configured to handle way more in redis.conf]

So I prefixed the key name by node()Partition

Also, using the synchronous version of erldis also certainly slowed a bit. A put/delete operation is four Redis operations :

– adding to the bucket/key to the world set.

– setting the bucket/key to the binary_to_term‘ed Value.

– adding the key to the specific Bucket set.

– adding the Bucket to the buckets set.

An incoming improvement will be to rollback if one of these operation fail (that’s an important one).

I sped up things by starting a process for each operation and wait for the result. The four operations are done in parallel for better efficiency.

Code is still a bit slower on insert/delete than DETS reference code, but consistently faster on mapreduce operations. (see the riak-playground escript)

The future

Will this code be useful ?

I hope it can help. Both Riak and Redis are great and are great complements to each other. Redis is very fast while Riak handles masterless replication redundancy and mapreduce. So I do find them being a great match together.

For the time being the problem is that Redis is limited to RAM sized data sets. But it won’t last. antirez is committed to releasing to a virtual memory version of redis this year.

So that should not be an issue soon.

And is it really a problem ? I see my code as mitigating this temporary issue !

I taught RDBMS for several years. I’m sorry Dr. Codd, but database systems never have been this fun.

riak_redis_backend is now passing all tests

See here !

I am a bit disappointed on insert times (twice slower than dets), but I could have a plan.

It’s been great fun and I learnt a lot on both Riak and Redis.

The gist is here — though it now deserves to migrate to a full size github project.

Later, after foie gras, salmon and fine French wine I might dissect the approach to the code, and my plan !

Merry Christmas !

A redis backend for riak.

Awesome + Awesome = AWESOMER

Riak is AWESOME.

redis is AWESOME.

Why not marry them and keep the babies ?

The riak_redis_backend is exactly the point. Data stuffed into riak is stored on a redis running on localhost on the standard port. (Having non-local data would kind of miss the point, right ?)
An added trick is that the redis database index is different for each riak nodes — if you happen to run all your Riak nodes on the same machine.

Warnings :

Implementation sucks big time, and though it passes the riak_test_util:standard_backend_test, I made it timeout on some MapRed jobs. And MapRed tasks also runs slower than dets ones.

[Update] performance has been improved by using a better encoding (meaning I removed that base64 horror … fprof told me that). My test code went down from 25 seconds to 15 seconds.

Can it be useful — one day ?

Not sure. Given that redis cannot currently have a dataset bigger than memory (but plans to have virtual memory), it kind of limit the data size.

Performance is currently worse than riak_dets_backend, but given that Redis is *fast*, one can hope for enhancements. I think that most of the improvement should come from the erldis library which could need some tweaking for better handling binaries.

Where is it ?

It’s over there !

How can I use it

Install the erldis (binary branch) application in your lib/ Riak node directory.
Compile and hurl in the riak/ebin directory. Restart riak node, make sure redis is running, and off you go.

Seabeyond, XMPP Process One event

Yesterday was the day it snowed in Paris for the first time this season.

But it was not the only event. People gathered from ten countries to Paris to attend the SeaBeyond meetup.

And it was a good one.

Capture d’écran 2009-12-18 à 18.31.40.png

There’s the official point of view just available.

But wait, here’s mine !


Hacking occasionally on ejabberd, meeting the devs is always good. Christophe, mod_pubsub’s maintainer hosted a great discussion on the subject.

Among the subjects :
Pubsub performance, the usual questions … How fast is pubsub ? Should I use ODBC or mnesia ? Why two modules ?

How fast is Pubsub ?

As of ejabberd 2.1, many improvements are now implemented. But how fast depends on how you use pubsub. Many nodes, few subscribers ? Many subscribers ? What is the subscription rate ? How many items per node ?

the last_item_cache did a lot of good for performance especially if you have a high user churn.

ODBC or Mnesia ?

Vast question, but you’ve got many nodes and many many items, you’re better off with ODBC.

Why two modules ?

There will never be a merge between ODBC and mnesia. ODBC has gone under many optimisations, limiting the number of queries (6 times less since 2.0). It’s too bad we won’t get storage backend abstraction … maintaining my S3/SimpleDB version is still a bit of work, and pushing fancy nosql versions (riak ? redis backends ?), but it’s for a better performance in each case.

There was more !

But I got sidetracked by an interesting discussion with Erlang Solutions‘s Mietek Bąk on Haskell — apologies to the rest of the guys on the pubsub tables, as we got quite enthusiastic and noisy … and put off our discussion until later.

Christophe told that as version 3 of ejabberd would implement exmpp, one should get ready to rewrite one’s nodes and node_trees, but performance would get way better with exmpp.

Many people, many discussions

Discussed with one of the Nokia guys, told me about the difficulties of being Nokia when you try to innovate. You have to please 250 mobile operators all with different opinions. Especially when you try to get around their old abusively expensive business as Nokia is trying with Ovi.
Also toyed a bit with the N900. Nice phone.

Talked with Sebastian Geib, freelance sysadmin from Berlin, about working in Berlin/Germany, compared to Paris/France.

Also learned about Meetic’s chat architecture (overpowered) and how erlang is viewed by sysadmin (not favorably by default :).

And presentations

About the admin panel for ejabberd, Jingle, BBC use of PEPanon pubsub on ejabberd, Yoono and Process One’s Wave Server.

BBC’s use of PEPanon pubsub can be seen here., in the topmost flash.

Had to leave early and missed the Champagne and the Wave Server demo. But this talk by Mickaël Rémond was quite interesting. Quote of day : “Google wants third party wave servers to be good but not too good.”

Next year

I’ll be back.

A strategy for testing for ejabberd modules

I’ve always been looking for an elegant way of testing custom ejabberd modules.
Tried a couple of ways before but was never convinced. Running tests against a running ejabberd node for example. But it’s not easy, many dependancies, and hard to set up. Mocking modules such as ejabberd_router. But either I hit weird issues, either it’s so cumbersome, I knew I’d never use it again.

But this time, I think I’ve got it.

Check out the cool combination of etap and erl_mock !

It’s on github with more blathering from yours truly.

Why fork the whole ejabberd tree ?

I had the question on PlanetErlang.

Why have you put whole ejabberd source to the repository? You could just put your modules to avoid constant merging from upstream.

Thank you, Anton, for enabling me to express some love to git and github.

The short answer

It’s easy and fun.

The longer answer

The early version of the code was actually in a separate private SVN repository. Part of my install procedure was copying the beams into the ejabberd ebin folder. But each time mod_muc or mod_pubsub modules were updated I had to launch FileMerge and merge things. And those modules are not slim.

Enters git and github. Brian J. Cully has a script updating every hour his ejabberd repository on github from the Process One svn repository.

My own ejabberd repository is fork from his.

And having my own tree up-to-date is only a matter of one (1) command :

“github pull bjc master“

Run sudo gem install github for installing the github gem.

Merges are done automatically. Of course the occasional conflict may arise, but whatever the process, I cannot avoid it.

Pushing to my github repository is also one command :

“git push origin master“

And if I want to send a patch right up to Process One ?

Say for pubsub …

“git diff bjc/master – src/mod_pubsub > pubsub.patch“

Contributing is easy

Fork my project, hack, push, pull request.

Can it be any simpler ? (This question is not rethorical)

ejabberd “cloud edition alpha”


It’s an ejabberd-based proof-of-concept, with a set of custom modules aiming for making it stateless and very scalable on the storage backend.

All state data (including user accounts, roster information, persistent conference room, pubsub nodes and subscriptions) are stored in AWS webservices, S3 or SimpleDB.

It helps scaling up and down, and keeps managing costs at a proportianal cost. AWS services are very wide, and massively parallel access is what it’s all about.

Default ejabberd configuration uses mnesia, but Process One recommends switching some services like roster or auth to ODBC when load increases.

But DBMS have their own scaling problems, and that’s yet another piece of software to administrate.

CouchDB seems loads of fun, and I’d like to put some effort running ejabberd over it later on. Some work has started, but not much progress yet. (and CouchDB is still software to one needs to manage).

Current state

  • ejabberd_auth_sdb : store users in SimpleDB. The version in github stores password encrypted, but forces password in PLAIN over XMPP, that means that TLS is required (really !). I have a version somewhere which exchanges hashes on the wire but stores password in clear in SimpleDB. Your call.

  • mod_roster_sdb : roster information is stored in SimpleDB

  • mod_pubsub : nodetree data is stored in S3 along with items. Subscriptions are stored in SimpleDB. I reimplemented nodetreedefault and nodedefault, with means that PEP works fine too.

  • mod_muc : Uses modular_muc with the S3 storage for persisting rooms.

  • mod_offline : S3 for storing offline messages

  • mod_last_sdb : Stores last activity in SimpleDB

Still lacking :

Following the names of the modules, where to store data, in my opinion.

  • mod_shared_roster : in SimpleDB

  • mod_vcard : VCards in S3, index in SimpleDB

  • mod_private : S3

  • mod_privacy : S3

  • mod_muc_log : S3 (with a specific setting for direct serving, maybe)

These modules are the only one which have state that should be persisted on disk. Mnesia is of course still be used for routing, configuration – but that’s transient data.

Transactions and latency

We loose transactions by switching away from mnesia or ODBC. That may or may not be a problem. I think it won’t be, but I don’t have data to prove one way or the other.

Latency also grows, but erlsdb and erls3, the libraries on which the modules are built, can interface with memcached (and are ketama enabled) if you use merle. Additionally using merle will keep usage costs down.

ejabberd mod_pubsub underwent several optimizations recently, and that improved performance of non-memcached AWS mod_pubsub. Initial code had latency around 10 seconds between publishing and receiving the event. Since last week’s improvement, performance is much better.

Down the road

I’d wish to see an EC2 AMI based on this code, just pass the domain name or the ejabberd.cfg file to ec2-start-instance and boom ! you have an ejabberd server up and running.

Want more horse power ? Start another one on the same domain in the same EC2 security group, the ejabberd nodes autodiscover each other and you’ve got a cluster. ec2nodefinder is designed for this use.

Combined with the very neat upcoming load-balancing and autoscaling services Amazon Web Services, there’s a great opportunity for deploying big and cheap!

Alternatives to the AWS loadbalancing would be pen, or a “native” XMPP solution.

A few things would need to be implemented for this to work well, like XMPP fast reconnect via resumption and/or C2S/S2S process migration between servers, because scaling down is as important as scaling up in the cloud.

If you want to participate, you’d be very welcome. Porting the modules I did not write, or testing and sending feedback would be … lovely.

And of course if Process One wants to integrate this code in a way or another, that would also be lovely !

Get it

Get it, clone it, fork it ! There’s bit of documentation on the README page.

[edited : added links to XEP-0198 and rfc3920bis-08, thanks to Zsombor Szabó for pointing me to them]

erlsdb and erls3 use ibrowse

I had some issues with inets under heavy load with erlsdb and erls3.

And when you are talking to Amazon Web Services, you’d want to write in parallel as much as possible. You also want to pipeline requests in one single socket, especially while using SSL encryption (even more costly to establish).

ibrowse seemed very interesting, especially since the CouchDB project started using it !

Got it out of jungerl, which is always a bit of a pain. You can also find it on github, I figured later.

Porting my code to ibrowse was quite easy. Though I had to change a bit of the async code. Instead of sending one message once the inets process received the HTTP response, it sends a message upon receiving headers then a slew of messages for each chunk it receives.

Had a few Too Many Open Files errors while loadtesting. As it appears, I had over 500 connections opened to Amazon AWS. Got more sensible defaults and the problem went away.

Configuration is by host, that forced me to change the naming of the S3 buckets from to

One caveat : accessing SimpleDB using SSL gives InvalidSignature errors for the time being. Will squash that soon.

Using ibrowse will also unable me to write a client to S3 that will stream files to and from disk.

The ibrowse version are in the ibrowse branch for both projects.

erls3 : OTP application for accessing S3

Just committed erls3 over to github.

Enables access to S3. Tailored for highly concurrent access with small items rather than sending multigigabyte items. Everything you get/send from/to S3 is stored in the VM.

Usage examples will come shortly.

The API is however very straightforward.

Erlang SimpleDB application


SimpleDB is a the cloud database by Amazon Web Services.

Still in beta, SimpleDB provides you with metered access to fat storage designed for an internet scale database. Compared to MySQL or other RDBMS, it has few features (no transactions, no joins …), but using it is a no-brainer.

Still having a library wrapping around the HTTP calls to SimpleDB is good.


Hence the erlsdb OTP app. However development seemed to have stopped. So I took it to github and hack it.

It went surprisingly quickly (most certainly due to erlang’s power than my own skills) as I managed to add async http calls, multiple workers and finished implementing the API in a few hours.

Still needs a bit of polish but it’s already waiting for feedback !

Get it here !


(if your eyes don’t burn from the syntax coloring)