First dive into Talend Open Studio

Background Story

Back in the days, I was involved in the development of an automation flow which required a language called BPEL (Business Process Execution Language). It’s the type of development when the GUI is presented with different components. And what is needed for developing flow is drag & drop on canvas.

To be honest, this kind of development looks fresh at the start. But later on, more issues were found during further interaction with these development tools so we have to be forced back to look into the code the trace down the error, since under the “fancy” cover the GUI tools, it’s auto-generating code mechanism which essentially still generates a fairly complex piece of java code.


Coming back to the topic of Talend of this post, it’s related to this particular previous experience because they both are the same type of development mode that works on interactive GUI to design and orchestrate workflows. And perhaps not surprisingly, they are generating JAVA code at the back.

So here is what I have experienced with Talend Open Studio.

The Good
  1. It’s essentially JAVA! The outcome of code generated is essentially java. Having said that, this means if there is a syntax error, or there is a misunderstanding of how code works, we can always look into the code part side right-side of “Designer” panel of canvas and find out the exact reason.
  2. The orchestration flow is clear and easy to read from start given a canvas based design flow illustration, which might be much clean and easy to read than the code. (Opinionated!)
  3. Many featured components to work on instead of implementing them from scratch, e.g. FTP connection, File read and listing and writing, AWS S3 interaction, and data flow processor such as tMap, tNormalize, tUnite, tJavaFlex, tJavaRow.
The Bad
  1. It’s essentially JAVA! The runtime Env is slow due to JVM and requires extra compilation beforehand. Digging through java runtime error is not fun. Code generated is more and more complex when putting more components into the canvas which eventually takes up all the resources.
  2. The context is double edged sword. It provides a clean and neat way of managing passing variables between jobs. But when lack of proper managing context in a much re-usable and clean way (similar to the concept of “eliminating global variables” in writing other codes), the number of them and maintenance overhead could easy be blown up.
  3. Some learning curve is expected when dealing with components like tJavaFlex. It may not work as originally expected when first come to use. And documentation about these components are just terrible all over the internet.
Learnings
  1. Get faster machines with bigger RAM
  2. Managing context with proper plan beforehand
  3. A good way of learning Talend is always trying to use it. It may take some time at the start, but it will always pay back at a later stage. Especially when one particular component is not familiar, put in tons of “System.out.println” will definitely help in understanding the priorities and flow.
  4. Putting in a lot of tWarn as placeholder and logging messages helps understand the application as well as helps program stand in a better position of self-organizing.
  5. Use tRunJob wisely since each job is representing a standalone process. Having said that, this means each job can be run independently and get valid result based on the ENV and inputs.
  6. Distinguish the concept of flow and row. Flow mainly focus on process orchestration while row represents the data stream. Having said that, there are many cases when we need to convert row data into different flow and vice versa. Think wisely. A lot of options here.
  7. Linkage between components such as “main”, “iterate” will pass the data row along the flow.
  8. Linkage between components such as “onSubJobOK”, “onSubJobError”, “onComponentOk”, “run If” will do the trigger once the current component and condition are met.
  9. Passing values between child job to parent job. Context and bufferOutput are commonly used. Be cautious about global variables.
  10. “CHILD_RETURN_CODE” is useful tool to reflex tRunJob running status.
  11. Useful tip as + to trigger lookup for all global variables available at given point.
  12. All exceptions should be handled properly, otherwise, it will be escalated to the top till the process get killed. Same rule as Java.