This paper describes wide-area distributed file system.
It follows POSIX and developers could control various trade-offs.
As application runs on multiple datacenter emerges, underlying infrastructure that is easy to use would be crucial. And filesystem is one of the most important systems. WheelFS looks suitable for this end as it offers standard interface and flexiblility as well. More systems supporting multiple datacenter concept will come, such as storage, database, processing, etc...
As there are many trade-off choices and it is application specific in many cases, giving developers an avility to specify cues would be a natural direction to go. However, having too many options often ends up with just using default. Sometimes, provided cues turns out to be inappropriate. It would be great if there is a way to make a smart decision based on profiling or something like that.
Monday, April 20, 2009
Scalling Out
Having multiple datacenters is becoming more important, both for better quality and fault tolerance. This article illustrate needs & problems of expanding a web service over multiple datacenters.
To solve cache consistency problem, they hacked MySQL to know when and what data is replicated - cache item is updated (actually deleted) when it is replicated.
Another problem they had was routing problem. As they allowed writing only on California databases, the user written to the database should remain on California until the entry is replicated to other datacenters to avoid confusion - and they picked 20 seconds for it.
Facebook's solutions looks practical enough, but far from general or formal solution. What if 3rd datacenter is coming? etc. Their solution only works for their settings and applications.
To solve cache consistency problem, they hacked MySQL to know when and what data is replicated - cache item is updated (actually deleted) when it is replicated.
Another problem they had was routing problem. As they allowed writing only on California databases, the user written to the database should remain on California until the entry is replicated to other datacenters to avoid confusion - and they picked 20 seconds for it.
Facebook's solutions looks practical enough, but far from general or formal solution. What if 3rd datacenter is coming? etc. Their solution only works for their settings and applications.
Wednesday, April 1, 2009
Erlang
Erlang is a language developed at Ericsson.
It actually include a concept of virtual machine just like Java, to support populating process.
This language takes massive parallelism and failure in mind from the beginning.
As a result, it is a good fit to large scale distributed system.
It is proven useful, as some of ericsson product and other open-source projects were written in Erlang and work well.
However, people complain about its syntax - As a functional language, the syntax of Erlang is not that easy compared to traditional imperative languages. To be a functional language seems inevitable to prevent side-effect which is bad for concurrent program, but it still keep ordinary programmers from using it.
It actually include a concept of virtual machine just like Java, to support populating process.
This language takes massive parallelism and failure in mind from the beginning.
As a result, it is a good fit to large scale distributed system.
It is proven useful, as some of ericsson product and other open-source projects were written in Erlang and work well.
However, people complain about its syntax - As a functional language, the syntax of Erlang is not that easy compared to traditional imperative languages. To be a functional language seems inevitable to prevent side-effect which is bad for concurrent program, but it still keep ordinary programmers from using it.
Wednesday, March 18, 2009
Scribe
Scribe is an open-source distributed logging system developed at Facebook.
It is designed to handle daunting amount of log, say 10 billion messages a day.
This logging system has three components:
1. Client Code interface - Thrift (generates glue-code for various languages), Category & Message
2. Distribution System - Scribe process runs on every node, it forward log message according to configuration file. (classified by category)
3. Do Something Usefullizer - log file? datawarehouse? HDFS/Hadoop?
claim: if your backend is scalable, why bother distributed logging system? (-> Chukwa)
Their major design decision is as following:
1. Don't assume a particular network topology: easily configurable
2. Reliability : reliable enough that we can expect to get all of the data almost all of the time, but not reliable enough to require heavyweight protocols and disk usage.
3. Simple Data model : a category and the actual message.
Simple, Unified, Scalable component is useful
It is designed to handle daunting amount of log, say 10 billion messages a day.
This logging system has three components:
1. Client Code interface - Thrift (generates glue-code for various languages), Category & Message
2. Distribution System - Scribe process runs on every node, it forward log message according to configuration file. (classified by category)
3. Do Something Usefullizer - log file? datawarehouse? HDFS/Hadoop?
claim: if your backend is scalable, why bother distributed logging system? (-> Chukwa)
Their major design decision is as following:
1. Don't assume a particular network topology: easily configurable
2. Reliability : reliable enough that we can expect to get all of the data almost all of the time, but not reliable enough to require heavyweight protocols and disk usage.
3. Simple Data model : a category and the actual message.
Simple, Unified, Scalable component is useful
Sunday, March 15, 2009
AWS
Amazon Web Service is consist of various components for cloud computing - EC2 for computing resources, S3 for storage, SQS for queuing and SimpleDB for structured storage.
AWS is not the first comer, but actually made cloud computing accessible and viable - technically and economically.
Definitely it made huge impact on the area. Everybody talking about cloud computing refer to AWS.
Combination of S3 and EC2 (and especially the fact that they don't charge for the transfer between S3 and EC2) makes AWS attractive. Simplicity of AWS is another attractive feature. It is rather primitive (= low level service) and doesn't offer that much, but users of AWS (who is a kind of early adaptor of cloud computing) cloud don't have to worry too much about locking-in.
AWS is not the first comer, but actually made cloud computing accessible and viable - technically and economically.
Definitely it made huge impact on the area. Everybody talking about cloud computing refer to AWS.
Combination of S3 and EC2 (and especially the fact that they don't charge for the transfer between S3 and EC2) makes AWS attractive. Simplicity of AWS is another attractive feature. It is rather primitive (= low level service) and doesn't offer that much, but users of AWS (who is a kind of early adaptor of cloud computing) cloud don't have to worry too much about locking-in.
Google Apps
When talking about cloud computing, it often ends up with debating security - is cloud really safe?
One argues that relying on a cloud provider and hand one's important data (or fate) to it is dangerous. The other points out that a cloud provider could do better in terms of security, especially compared to small or medium organizations.
Both are true, and the key is to increase security level of cloud computing and to convince perspective customer (= cloud user) that the cloud infrastructure is safe enough, or at least there is fall back plan if something went wrong.
One argues that relying on a cloud provider and hand one's important data (or fate) to it is dangerous. The other points out that a cloud provider could do better in terms of security, especially compared to small or medium organizations.
Both are true, and the key is to increase security level of cloud computing and to convince perspective customer (= cloud user) that the cloud infrastructure is safe enough, or at least there is fall back plan if something went wrong.
Azure
Azure is set of cloud platform/service from Microsoft.
Developers write a .net based code to run on Microsoft's cloud infrastructure. The program may also use various cloud services such as SQL and Live services.
Compared to AWS, Azure is value-added cloud infrastructure. In other words, it offers more than just providing bare virtual machine. But at the same time, it makes hard to migrate to other services. If there is only one service provider, users would be more hesitant. Will Microsoft allow 3rd party provider to proliferate the Azure platform? Or keep it proprietary as they always did?
Developers write a .net based code to run on Microsoft's cloud infrastructure. The program may also use various cloud services such as SQL and Live services.
Compared to AWS, Azure is value-added cloud infrastructure. In other words, it offers more than just providing bare virtual machine. But at the same time, it makes hard to migrate to other services. If there is only one service provider, users would be more hesitant. Will Microsoft allow 3rd party provider to proliferate the Azure platform? Or keep it proprietary as they always did?
Monday, March 2, 2009
Pig Latin
Pig-Latin is a query language runs on top of Hadoop.
Even though Hadoop or MapReduce style programming framework is good to write a program to analyze large dataset, it is rigid and low-level. Pig-Latin aims to privode means to write ad-hoc analysis queries. This should be quite influential, as the more it is easy to use, the more users will use it. Actually there are a few different approaches aims the same end - Sawzal, DryadLINQ, etc. Each approach has its own characteristics and pros/cons. It would be interesting to see which one will win this area. To me, Pig Latin is seems closest to SQL, which is widely adopted language. If the performance of Pig (an execution stack of Pig-Latin) is good enough and Pig-Latin could privode with flexible (and easy to use) grammar, combined with the fact that this is an open-source project to which many prospective users have access, Pig would win a significant portion.
Even though Hadoop or MapReduce style programming framework is good to write a program to analyze large dataset, it is rigid and low-level. Pig-Latin aims to privode means to write ad-hoc analysis queries. This should be quite influential, as the more it is easy to use, the more users will use it. Actually there are a few different approaches aims the same end - Sawzal, DryadLINQ, etc. Each approach has its own characteristics and pros/cons. It would be interesting to see which one will win this area. To me, Pig Latin is seems closest to SQL, which is widely adopted language. If the performance of Pig (an execution stack of Pig-Latin) is good enough and Pig-Latin could privode with flexible (and easy to use) grammar, combined with the fact that this is an open-source project to which many prospective users have access, Pig would win a significant portion.
Wednesday, February 25, 2009
HIVE
Hive is an data warehousing system developed by Facebook.
On top of Hadoop, Hive enables developers (or analysist) to write quick query to examine large dataset.
Even though the features of Hive is relatively simple compared with heavier language (e.g. Pig-Latin, DryadLINQ, etc...), it seems working in many cases. If a user need more sophisticated analysis, he/she could write his/her own MapReduce code. If it is a simple task, it is better to use Hive to run that query - Hive is good fit to this scenario. This is a 'good enough' system for many cases, and a stepping stone to more high-profile languages, which would take time to mature.
On top of Hadoop, Hive enables developers (or analysist) to write quick query to examine large dataset.
Even though the features of Hive is relatively simple compared with heavier language (e.g. Pig-Latin, DryadLINQ, etc...), it seems working in many cases. If a user need more sophisticated analysis, he/she could write his/her own MapReduce code. If it is a simple task, it is better to use Hive to run that query - Hive is good fit to this scenario. This is a 'good enough' system for many cases, and a stepping stone to more high-profile languages, which would take time to mature.
Wednesday, February 18, 2009
Dynamo
This paper describes Dynamo, Amazon's highly available Key-value store.
As the title says, Dynamo is a simple key-value store, runs on large number of servers, takes failure into account as a normal event, and makes trade-off for availability.
As we've seen so far, lots of traditional data store don't work well in super-large configuration. GFS and BigTable represents google's response to the issue, and Dynamo is Amazon's.
The most interesting point is that they weight write over read - usually the opposite is the case.
This is due to how Amazon's main application - e-commerce - works.
Another interesting point is emphasis on SLA and its representation - 99.9%.
Dynamo is highly engineered toward meeting Amazon's needs.
Even though Dynamo let users choose some trade-off, but there are already a lot of assumption regarding workload characteristic. (Of course, this is the case with GFS and BigTable)
I guess that it is time for another general solution to come. (SCADS?)
As the title says, Dynamo is a simple key-value store, runs on large number of servers, takes failure into account as a normal event, and makes trade-off for availability.
As we've seen so far, lots of traditional data store don't work well in super-large configuration. GFS and BigTable represents google's response to the issue, and Dynamo is Amazon's.
The most interesting point is that they weight write over read - usually the opposite is the case.
This is due to how Amazon's main application - e-commerce - works.
Another interesting point is emphasis on SLA and its representation - 99.9%.
Dynamo is highly engineered toward meeting Amazon's needs.
Even though Dynamo let users choose some trade-off, but there are already a lot of assumption regarding workload characteristic. (Of course, this is the case with GFS and BigTable)
I guess that it is time for another general solution to come. (SCADS?)
Wednesday, February 11, 2009
Chubby
Chubby offer distributed lock system, to be used by other services.
Most interesting part of this work is, Chubby is out of practical reason - programmers are familiar with locking, and often retroactively add scale-out feature, so distributed lock system is more suitable than something like Paxos library.
This implies an important lesson. Large-scale distributed system evolves over long time period, and lots of programmers (with different skill level) write code for it. This means that 'easy-to-use' is often more essential than super-efficiency or elegant implementation.
Also, this kind of 'infrastructure service' for providing various services will be more prevalent, as this is something like 'middle ware for cloud'!
Most interesting part of this work is, Chubby is out of practical reason - programmers are familiar with locking, and often retroactively add scale-out feature, so distributed lock system is more suitable than something like Paxos library.
This implies an important lesson. Large-scale distributed system evolves over long time period, and lots of programmers (with different skill level) write code for it. This means that 'easy-to-use' is often more essential than super-efficiency or elegant implementation.
Also, this kind of 'infrastructure service' for providing various services will be more prevalent, as this is something like 'middle ware for cloud'!
BigTable
BigTable is simplified structured data storage for large distributed system.
As opposite to google file system which is mainly for write-once read-many data, BigTable supports more frequent update. And this is especially engineered toward google's specific needs. For example, string is the only data supported.
This is google's answer to their persistant data storage needs, which is not supported by any existing databases. As system scales, there are many applications (or services) that traditional database system doesn't consider. Or, sometimes database system seems too general to be used for specific purpose. Trends are going backward, to build own storage system, at least for very-large services.
As opposite to google file system which is mainly for write-once read-many data, BigTable supports more frequent update. And this is especially engineered toward google's specific needs. For example, string is the only data supported.
This is google's answer to their persistant data storage needs, which is not supported by any existing databases. As system scales, there are many applications (or services) that traditional database system doesn't consider. Or, sometimes database system seems too general to be used for specific purpose. Trends are going backward, to build own storage system, at least for very-large services.
Google File System
Google file system solves google's need to 1) store large volume 2) on commodity hardwares.
Before Google file system (or even now), people usually purchased expensive storage system (NAS, SAN, etc...) to store large volume of data safely. But this is costly, and inefficient in terms of resource utilization, as there are many idle storage space on commodity servers. Google came up with the way to utilize it.
Their design is based on practical reasons. Having single master is a good reason - this is definitely single point of failure, but they chose ease of management and simplicity over robust but complex solution. At least, it is useful enough in their usage pattern.
Also, they have made several trade-off choice to support 1st class customer better - write once read many data. Which implies that many other storage systems should appear to support other needs. BigTable is a good example. This leaves a question - The era of general solution (as database) has gone? Is it way better to devise different system for each need?
Before Google file system (or even now), people usually purchased expensive storage system (NAS, SAN, etc...) to store large volume of data safely. But this is costly, and inefficient in terms of resource utilization, as there are many idle storage space on commodity servers. Google came up with the way to utilize it.
Their design is based on practical reasons. Having single master is a good reason - this is definitely single point of failure, but they chose ease of management and simplicity over robust but complex solution. At least, it is useful enough in their usage pattern.
Also, they have made several trade-off choice to support 1st class customer better - write once read many data. Which implies that many other storage systems should appear to support other needs. BigTable is a good example. This leaves a question - The era of general solution (as database) has gone? Is it way better to devise different system for each need?
Monday, February 9, 2009
eBay's Scaling Odyssey & All I Need is Scale!
Those two slides presents eBay's principles & practices on distributed computing and cloud computing.
Principles described there are classical, but it is important to put them together and keep it in mind whenever we design systems to be scale out, like eBay. This slides are valuable as those are from the actual experience! However, it were more interesting if it included more detailed description how they did various trade-off, as that is easy to say but hard to generalize, and everything is case by case.
To them, it should be natural to see opportunity in cloud computing, for economic and operational reason. As their architecture is already ready-to-scale, it should be relatively easy to adopt cloud computing. To do so, they would like to figure out the difference between internal DC / internal cloud / 3rd party clouds, but it is not clear yet what the dominant 3rd party clouds will look like.
Principles described there are classical, but it is important to put them together and keep it in mind whenever we design systems to be scale out, like eBay. This slides are valuable as those are from the actual experience! However, it were more interesting if it included more detailed description how they did various trade-off, as that is easy to say but hard to generalize, and everything is case by case.
To them, it should be natural to see opportunity in cloud computing, for economic and operational reason. As their architecture is already ready-to-scale, it should be relatively easy to adopt cloud computing. To do so, they would like to figure out the difference between internal DC / internal cloud / 3rd party clouds, but it is not clear yet what the dominant 3rd party clouds will look like.
Wednesday, February 4, 2009
A Policy-aware Switching Layer for Data Centers
This paper tries to make middle-box the 1st class citizen in data center environment.
As the network configuration in a data center is getting more complex, making sure all packets going through middle box is getting harder. This paper introduces a programmable switch to apply middle-box policy. This will introduce some overhead at switch level, but operational benefit will be large enough to justify it. As operational cost of data center is becoming increasingly significant, this kind effort to make it simple & effective will get more attention.
As the network configuration in a data center is getting more complex, making sure all packets going through middle box is getting harder. This paper introduces a programmable switch to apply middle-box policy. This will introduce some overhead at switch level, but operational benefit will be large enough to justify it. As operational cost of data center is becoming increasingly significant, this kind effort to make it simple & effective will get more attention.
DCell: A Scalable and Fault-Tolerant Network Structure for Data
This paper tries to solve similar problem of the previous paper.
It looks similar to hypercube approach in supercomputing world.
DCell introduces complexity in wiring, much more than fat-tree approach.
Even an addition of single computer is not simple.
We can't just add a server or a rack and plug it in.
We should always consider topology.
Also this makes it hard to determine locality.
It looks similar to hypercube approach in supercomputing world.
DCell introduces complexity in wiring, much more than fat-tree approach.
Even an addition of single computer is not simple.
We can't just add a server or a rack and plug it in.
We should always consider topology.
Also this makes it hard to determine locality.
A Scalable, Commodity Data Center Network Architecture
This paper describes fat-tree topology to utilize commodity switch in data center networking.
As the price of high-end switch is too expensive, many people tried to use commodity hardware to reduce cost. Also, this tries to increase bisection bandwidth, which is becoming more important due to MapReduce type applications. But it comes not for free. Especially, this makes load balancing and wiring more complex, in turn this solution impractical.
As the price of high-end switch is too expensive, many people tried to use commodity hardware to reduce cost. Also, this tries to increase bisection bandwidth, which is becoming more important due to MapReduce type applications. But it comes not for free. Especially, this makes load balancing and wiring more complex, in turn this solution impractical.
Monday, February 2, 2009
Designing a highly availabile directory service
Designing a highly availabile directory service
As data center failure is common, it is important to have software system to cope with it. Basic strategy is making replication to fail over.
An interesting point regarding this article is replication & recovery scenario for different configuration - in case of 1, 2, 3, 5 datacenter.
More replication makes the system more reliable, and more complicated - even in this kind of imaginary scenario.
Real-life scenario might be far more complex, and need more manual fine-tune to achieve adequate performance & reliability as topology & capability of datacenters & links might be diverse.
It seems hard to generalize the solution.
As data center failure is common, it is important to have software system to cope with it. Basic strategy is making replication to fail over.
An interesting point regarding this article is replication & recovery scenario for different configuration - in case of 1, 2, 3, 5 datacenter.
More replication makes the system more reliable, and more complicated - even in this kind of imaginary scenario.
Real-life scenario might be far more complex, and need more manual fine-tune to achieve adequate performance & reliability as topology & capability of datacenters & links might be diverse.
It seems hard to generalize the solution.
Failure Trends in a Large Disk Drive Population
Failure Trends in a Large Disk Drive Population
The virtue of this paper : REAL numbers.
As the size of system is growing, only a handful of large company is able to see what's going on with that size. And this makes the life of researchers without the access to large resources harder. Cloud computing would help in conducting experiment in larger scale & make it more convincing.
This paper shows some interesting statistics on disk driver failure. It is not surprising that the failure is strongly correlated to model number, but this suggests that the trends could change as technology evolves. It would be interesting to see what will look like with SSD, which has different characteristics from HDD. In case of SSD, we may want to look more, e.g. writing pattern / locality, as SSD is worning out.
The virtue of this paper : REAL numbers.
As the size of system is growing, only a handful of large company is able to see what's going on with that size. And this makes the life of researchers without the access to large resources harder. Cloud computing would help in conducting experiment in larger scale & make it more convincing.
This paper shows some interesting statistics on disk driver failure. It is not surprising that the failure is strongly correlated to model number, but this suggests that the trends could change as technology evolves. It would be interesting to see what will look like with SSD, which has different characteristics from HDD. In case of SSD, we may want to look more, e.g. writing pattern / locality, as SSD is worning out.
Failure stories
- Crash: Data Center Horror Stories
- Data Center Failure As A Learning Experience
- Generator Failures Caused 365 Main Outage
Data center experts should spend more time to think about 'unthinkable' failure scenario to make it more reliable.
Subscribe to:
Posts (Atom)