Wednesday, March 18, 2009

Scribe

Scribe is an open-source distributed logging system developed at Facebook.

It is designed to handle daunting amount of log, say 10 billion messages a day.

This logging system has three components:

1. Client Code interface - Thrift (generates glue-code for various languages), Category & Message
2. Distribution System - Scribe process runs on every node, it forward log message according to configuration file. (classified by category)
3. Do Something Usefullizer - log file? datawarehouse? HDFS/Hadoop?
claim: if your backend is scalable, why bother distributed logging system? (-> Chukwa)

Their major design decision is as following:
1. Don't assume a particular network topology: easily configurable
2. Reliability : reliable enough that we can expect to get all of the data almost all of the time, but not reliable enough to require heavyweight protocols and disk usage.
3. Simple Data model : a category and the actual message.

Simple, Unified, Scalable component is useful

Sunday, March 15, 2009

AWS

Amazon Web Service is consist of various components for cloud computing - EC2 for computing resources, S3 for storage, SQS for queuing and SimpleDB for structured storage.

AWS is not the first comer, but actually made cloud computing accessible and viable - technically and economically.
Definitely it made huge impact on the area. Everybody talking about cloud computing refer to AWS.

Combination of S3 and EC2 (and especially the fact that they don't charge for the transfer between S3 and EC2) makes AWS attractive. Simplicity of AWS is another attractive feature. It is rather primitive (= low level service) and doesn't offer that much, but users of AWS (who is a kind of early adaptor of cloud computing) cloud don't have to worry too much about locking-in.

Google Apps

When talking about cloud computing, it often ends up with debating security - is cloud really safe?
One argues that relying on a cloud provider and hand one's important data (or fate) to it is dangerous. The other points out that a cloud provider could do better in terms of security, especially compared to small or medium organizations.
Both are true, and the key is to increase security level of cloud computing and to convince perspective customer (= cloud user) that the cloud infrastructure is safe enough, or at least there is fall back plan if something went wrong.

Azure

Azure is set of cloud platform/service from Microsoft.
Developers write a .net based code to run on Microsoft's cloud infrastructure. The program may also use various cloud services such as SQL and Live services.

Compared to AWS, Azure is value-added cloud infrastructure. In other words, it offers more than just providing bare virtual machine. But at the same time, it makes hard to migrate to other services. If there is only one service provider, users would be more hesitant. Will Microsoft allow 3rd party provider to proliferate the Azure platform? Or keep it proprietary as they always did?

Monday, March 2, 2009

Pig Latin

Pig-Latin is a query language runs on top of Hadoop.
Even though Hadoop or MapReduce style programming framework is good to write a program to analyze large dataset, it is rigid and low-level. Pig-Latin aims to privode means to write ad-hoc analysis queries. This should be quite influential, as the more it is easy to use, the more users will use it. Actually there are a few different approaches aims the same end - Sawzal, DryadLINQ, etc. Each approach has its own characteristics and pros/cons. It would be interesting to see which one will win this area. To me, Pig Latin is seems closest to SQL, which is widely adopted language. If the performance of Pig (an execution stack of Pig-Latin) is good enough and Pig-Latin could privode with flexible (and easy to use) grammar, combined with the fact that this is an open-source project to which many prospective users have access, Pig would win a significant portion.