There is one question I hear every time I make a presentation about Hadoop to an audience of DBAs. This question was also recently asked in LinkedIn’s DBA Manager forum, so I finally decided to answer it in writing, once and for all.
“As we all see there are lot of things happening on Big Data using Hadoop etc….
Can you let me know where do normal DBAs like fit in this :
DBAs supporting normal OLTP databases using Oracle, SQL Server databases
DBAs who support day to day issues in Datawarehouse environments .
Can you let me know where do normal DBAs like fit in this :
DBAs supporting normal OLTP databases using Oracle, SQL Server databases
DBAs who support day to day issues in Datawarehouse environments .
Do DBAs need to learn Java (or) Storage Admin ( like SAN technology ) to get into Big Data ? ”
I hear a few questions here:
- Do DBAs have a place at all in Big Data and Hadoop world? If so, what is that place?
- Do they need new skills? Which ones?
Let me start by introducing everyone to a new role that now exists in many organizations: Hadoop Cluster Administrator.
Organizations that did not yet adopt Hadoop sometimes imagine Hadoop as a developer-only system. I think this is the reason why I get so many questions about whether or not we need to learn Java every time I mention Hadoop. Even within Pythian, when I first introduced the idea of Hadoop services, my managers asked whether we will need to learn Java or hire developers.
Organizations that did adopt Hadoop found out that any production cluster larger than 20-30 nodes requires a full time admin. This admin’s job is surprising similar to a DBA’s job – he is responsible for the performance and availability of the cluster, the data it contains, and the jobs that run there. The list of tasks is almost endless and also strangely familiar – deployment, upgrades, troubleshooting, configuration, tuning, job management, installing tools, architecting processes, monitoring, backups, recovery, etc.
I did not see a single organization with production Hadoop cluster that didn’t have a full-time admin, but if you don’t believe me – note that Cloudera is offering Hadoop Administrator Certification and that O’Reilly is selling a book called “Hadoop Operations”.
So you are going to need a Hadoop admin.
Who are the candidates for the position? The best option is to hire an experienced Hadoop admin. In 2-3 years, no one will even consider doing anything else. But right now there is an extreme shortage of Hadoop admins, so we need to consider less perfect candidates. The usual suspects tend to be: Junior java developers, sysadmins, storage admins, and DBAs.
Junior java developers tend not to do well in cluster admin role, just like PL/SQL developers rarely make good DBAs. Operations and dev are two different career paths, that tend to attract different types of personalities.
When we get to the operations personnel, storage admins are usually out of consideration because their skillset is too unique and valuable to other parts of the organization. I’ve never seen a storage admin who became a Hadoop admin, or any place where it was even seriously considered.
I’ve seen both DBAs and sysadmins becoming excellent Hadoop admins. In my highly biased opinions, DBAs have some advantages:
- Everyone knows DBA stands for “Default Blame Acceptor”. Since the database is always blamed, DBAs typically have great troubleshooting skills, processes, and instincts. All of these are critical for good cluster admins.
- DBAs are used to manage systems with millions of knobs to turn, all of which have a critical impact on the performance and availability of the system. Hadoop is similar to databases in this sense – tons of configurations to fine-tune.
- DBAs, much more than sysadmins, are highly skilled in keeping developers in check and making sure no one accidentally causes critical performance issues on an entire system. This skill is critical when managing Hadoop clusters.
- DBA experience with DWH (especially Exadata) is very valuable. There are many similarities between DWH workloads and Hadoop workloads, and similar principles guide the management of the system.
- DBAs tend to be really good at writing their own monitoring jobs when needed. Every production database system I’ve seen has crontab file full of customized monitors and maintenance jobs. This skill continues to be critical for Hadoop system.
To be fair, sysadmins also have important advantages:
- They typically have more experience managing huge number of machines (much more so than DBAs).
- They have experience working with configuration management and deployment tools (puppet, chef), which is absolutely critical when managing large clusters.
- They can feel more comfortable digging in the OS and network when configuring and troubleshooting systems, which is an important part of Hadoop administration.
Note that in both cases I’m talking about good, experienced admins – not those that can just click their way through the UI. Those who really understand their systems and much of what is going on outside the specific system they are responsible for. You need DBAs who care about the OS, who understand how hardware choices impact performance, and who understand workload characteristics and how to tune for them.
There is another important role for DBAs in the Hadoop world: Hadoop jobs often get data from databases or output data to databases. Good DBAs are very useful in making sure this doesn’t cause issues. (Even small Hadoop clusters can easily bring down an Oracle database by starting too many full-table scans at once.) In this role, the DBA doesn’t need to be part of the Hadoop team as long as there is good communication between the DBA and Hadoop developers and admins.
What about Java?
Hadoop is written in Java, and a fairly large amount of Hadoop jobs will be written in Java too.
Hadoop admins will need to be able to read Java error messages (because this is typically what you get from Hadoop), understand concepts of Java virtual machines and a bit about tuning them, and write small Java programs that can help in troubleshooting. On the other hand, most admins don’t need to write huge amounts of Hadoop code (you have developers for that), and for what they do write, non-Java solutions such as Streaming, Hive, and Pig (and Impala!) can be enough. My experience taught me that good admins learn enough Java to work on Hadoop cluster within a few days. There’s really not that much to know.
Hadoop is written in Java, and a fairly large amount of Hadoop jobs will be written in Java too.
Hadoop admins will need to be able to read Java error messages (because this is typically what you get from Hadoop), understand concepts of Java virtual machines and a bit about tuning them, and write small Java programs that can help in troubleshooting. On the other hand, most admins don’t need to write huge amounts of Hadoop code (you have developers for that), and for what they do write, non-Java solutions such as Streaming, Hive, and Pig (and Impala!) can be enough. My experience taught me that good admins learn enough Java to work on Hadoop cluster within a few days. There’s really not that much to know.
What about SAN technology?
Hadoop storage system is very different from SAN and generally uses local disks (JBOD), not storage arrays and not even RAID. Hadoop admins will need to learn about HDFS, Hadoop’s file system, but not about traditional SAN systems. However, if they are DBAs or sysadmins, I suspect they already know far too much about SAN storage.
Hadoop storage system is very different from SAN and generally uses local disks (JBOD), not storage arrays and not even RAID. Hadoop admins will need to learn about HDFS, Hadoop’s file system, but not about traditional SAN systems. However, if they are DBAs or sysadmins, I suspect they already know far too much about SAN storage.
So what skills do Hadoop Administrators need?
First and foremost, Hadoop admins need general operational expertise such as good troubleshooting skills, understanding of system’s capacity, bottlenecks, basics of memory, CPU, OS, storage, and networks. I will assume that any good DBA has these covered.
Second, good knowledge of Linux is required, especially for DBAs who spent their life working with Solaris, AIX, and HPUX. Hadoop runs on Linux. They need to learn Linux security, configuration, tuning, troubleshooting, and monitoring. Familiarity with open source configuration management and deployment tools such as Puppet or Chef can help. Linux scripting (perl / bash) is also important – they will need to build a lot of their own tools here.
Third, they need Hadoop skills. There’s no way to avoid this :) They need to be able to deploy Hadoop cluster, add and remove nodes, figure out why a job is stuck or failing, configure and tune the cluster, find the bottlenecks, monitor critical parts of the cluster, configure name-node high availability, pick a scheduler and configure it to meet SLAs, and sometimes even take backups.
So yes, there’s a lot to learn. But very little of it is Java, and there is no reason DBAs can’t do it. However, with Hadoop Administrator being one of the hottest jobs in the market (judging by my LinkedIn inbox), they may not stay DBAs for long after they become Hadoop Admins…
Any DBAs out there training to become Hadoop admins? Agree that Java isn’t that important? Let me know in the comments.
No comments:
Post a Comment