First dive into Talend Open Studio

Background Story

Back in the days, I was involved in the development of an automation flow which required a language called BPEL (Business Process Execution Language). It’s the type of development when the GUI is presented with different components. And what is needed for developing flow is drag & drop on canvas.

To be honest, this kind of development looks fresh at the start. But later on, more issues were found during further interaction with these development tools so we have to be forced back to look into the code the trace down the error, since under the “fancy” cover the GUI tools, it’s auto-generating code mechanism which essentially still generates a fairly complex piece of java code.


Coming back to the topic of Talend of this post, it’s related to this particular previous experience because they both are the same type of development mode that works on interactive GUI to design and orchestrate workflows. And perhaps not surprisingly, they are generating JAVA code at the back.

So here is what I have experienced with Talend Open Studio.

The Good
  1. It’s essentially JAVA! The outcome of code generated is essentially java. Having said that, this means if there is a syntax error, or there is a misunderstanding of how code works, we can always look into the code part side right-side of “Designer” panel of canvas and find out the exact reason.
  2. The orchestration flow is clear and easy to read from start given a canvas based design flow illustration, which might be much clean and easy to read than the code. (Opinionated!)
  3. Many featured components to work on instead of implementing them from scratch, e.g. FTP connection, File read and listing and writing, AWS S3 interaction, and data flow processor such as tMap, tNormalize, tUnite, tJavaFlex, tJavaRow.
The Bad
  1. It’s essentially JAVA! The runtime Env is slow due to JVM and requires extra compilation beforehand. Digging through java runtime error is not fun. Code generated is more and more complex when putting more components into the canvas which eventually takes up all the resources.
  2. The context is double edged sword. It provides a clean and neat way of managing passing variables between jobs. But when lack of proper managing context in a much re-usable and clean way (similar to the concept of “eliminating global variables” in writing other codes), the number of them and maintenance overhead could easy be blown up.
  3. Some learning curve is expected when dealing with components like tJavaFlex. It may not work as originally expected when first come to use. And documentation about these components are just terrible all over the internet.
Learnings
  1. Get faster machines with bigger RAM
  2. Managing context with proper plan beforehand
  3. A good way of learning Talend is always trying to use it. It may take some time at the start, but it will always pay back at a later stage. Especially when one particular component is not familiar, put in tons of “System.out.println” will definitely help in understanding the priorities and flow.
  4. Putting in a lot of tWarn as placeholder and logging messages helps understand the application as well as helps program stand in a better position of self-organizing.
  5. Use tRunJob wisely since each job is representing a standalone process. Having said that, this means each job can be run independently and get valid result based on the ENV and inputs.
  6. Distinguish the concept of flow and row. Flow mainly focus on process orchestration while row represents the data stream. Having said that, there are many cases when we need to convert row data into different flow and vice versa. Think wisely. A lot of options here.
  7. Linkage between components such as “main”, “iterate” will pass the data row along the flow.
  8. Linkage between components such as “onSubJobOK”, “onSubJobError”, “onComponentOk”, “run If” will do the trigger once the current component and condition are met.
  9. Passing values between child job to parent job. Context and bufferOutput are commonly used. Be cautious about global variables.
  10. “CHILD_RETURN_CODE” is useful tool to reflex tRunJob running status.
  11. Useful tip as + to trigger lookup for all global variables available at given point.
  12. All exceptions should be handled properly, otherwise, it will be escalated to the top till the process get killed. Same rule as Java.

CORS and cross-origin HTTP requests

It seems that you have already known quite a few about HTTP request and CORS. But there is always something to learn when you look closer

CORS is short for Cross-Origin Resource Sharing. As the word shows, it gives web servers cross-domain access controls over either static or dynamic data. For static data sharing, its typical type of how CDN works. In essence, it adds new HTTP headers that allow servers to describe the set of origins that are permitted to read that information using a web browser. This is the basic background knowledge.

An interesting note about these requests is there are types of “preflighted” ones. The cause of this is many HTTP methods could have side-effects on server’s data. In order to prevent actual data impact, a more elegant solution would be mandating the browser “preflight” the request to get “approval” from the server first by an HTTP OPTIONS request, then sending the actual request with actual HTTP request.

There are conditions to trigger a CORS preflight, falling any of below 3 conditions will need a preflighted request. Otherwise, a simple request will be sufficient to achieve the goal.

  1. Request uses methods other than following

    • GET
    • POST
    • HEAD
  2. Request includes any below headers
    • Accept
    • Accept-Language
    • Content-Language
    • Content-Type (but note the additional requirements below)
    • DPR
    • Downlink
    • Save-Data
    • Viewport-Width
    • Width
  3. If the Content-Type header has a value other than the following
    • application/x-www-form-urlencoded
    • multipart/form-data
    • text/plain

Below is the illustration of how requests are done between client and server.

Here is the full exchange headers info for a preflight request. First is the preflight request.

Second is the actual request for resource.

Handy script to work with PDF files

PDF is commonly used printing file format. Creating & organising it might be a bit hard if without Acrobat Pro’s help.

Today I discovered some open source lib to manipulate PDF files, including Creating, Splitting, Merging, and Re-Ordering PDF pages.

Below is the script line for creating PDF,

echo 'Hello World!' | enscript -B -o - | ps2pdf - content.pdf

-B for enscript omit the header for the output


Here is the script for re-ordering the raw PDF into a book-style printable version.

#!/bin/bash

if [ $# -ne 2 ] ; then
    echo "Invalid input supplied, please choose input PDF and page counts"
    echo "    example : ./format.sh <input.pdf> <pageCount>"
    exit 1
fi

echo "Formatting begins ... "
file="$1"
pageCount="$2"

let "blankPageCount = $pageCount % 4"

# Check if blank page padding needed, if so create one
if [ $blankPageCount -ne 0 ] ; then
    echo ' ' | ps2pdf - blank.pdf 
fi

# setup each fold loop - one fold means one print page
let "foldCount = $pageCount / 4"
startFold=0
pageOrganizor=''

echo "    total $foldCount printing slides generated";
echo "    with $blankPageCount blank pages";

while [ $startFold -lt $foldCount ] ; do
    let "coverPrintpage = $startFold * 4 + 4"
    let "leftPrintpage = $startFold * 4 + 1"
    let "backPrintpage = $startFold * 4 + 3"
    pageOrganizor="$pageOrganizor A$coverPrintpage A$leftPrintpage-$backPrintpage"
    let "startFold += 1"
done

if [ $blankPageCount -eq 0 ] ; then
    cmd="pdftk A=$file cat$pageOrganizor output fmt_output.pdf"
elif [ $blankPageCount -eq 3 ] ; then
    let "blankStartPage = $pageCount-2"
    cmd="pdftk A=$file B=blank.pdf cat$pageOrganizor B A$blankStartPage-$pageCount output fmt_output.pdf"
elif [ $blankPageCount -eq 2 ] ; then
    let "blankStartPage = $pageCount-1"
    cmd="pdftk A=$file B=blank.pdf cat$pageOrganizor B A$blankStartPage A$pageCount B output fmt_output.pdf"
else
    cmd="pdftk A=$file B=blank.pdf cat$pageOrganizor B A$pageCount B B output fmt_output.pdf"
fi

echo "Merging pdfs ... "
$cmd

# cleanning up
echo "Cleanup and done"
if [ -e "blank.pdf" ] ; then
    rm blank.pdf
fi

Outline about RESTful API system

Recently I have been doing RESTful API programming a lot. My understanding about RESTful evolves from “HTTP-Only” request to something more. Below is what I get from recent thoughts.

  1. REST is Representational State Transfer

    It is and only is a design framework for making network communication “web-like.” No specific protocol was proposed, and its main focus was on the communication part. Several questions should be considered during the developement,

    • What are the components of the system?
    • How should they communicate with each other?
    • How do we ensure we can swap out different parts of the system at any time?
    • How can the system be scaled up to serve billions of users?
  2. Client-server

    This is basic system flow of any REST system looks like. It has to be 1-to-1 flow with a centralized resource server. And any other parties who wants to interact with the resources

  3. Stateless

    Most important principle in REST, which means “Each request is treated as a standalone“. In detail, when a client is not interacting with the server, the server has no idea of its existence. The server also does not keep a record of past requests.

  4. Stable identification of resources

    Each resource must be uniquely identified by a stable identifier. A “stable” identifier means that it does not change across interactions, and it does not change even when the state of the resource changes.

  5. Manipulation of resources through representations

    Client manipulates resources through sending representations to the server–usually a JSON object containing the content that it would like to add, delete, or modify. In WEB case, it sends an HTTP POST or PUT request with the content to the server. The server sends back a response indicating whether operation was successful.

  6. Self-descriptive messages

    A self-descriptive message is one that contains all the information that the recipient needs to understand it. In WEB case, for example, request like below

    GET / HTTP/1.1
    Host: www.example.com
    

    This message is self-descriptive because it told the server what HTTP method was used, and the protocol that was used (HTTP 1.1).

    The server may send back a response like this:

    HTTP/1.1 200 OK
    Content-Type: text/html
    
    <!DOCTYPE html>
    <html>
      <head>
        <title>Home Page</title>
      </head>
      </body>
        <div>Hello World!</div>
      </body>
    </html>
    

    This message is self-descriptive because it told the client how it needs to interpret the message body, by indicating that Content-type was text/html.

  7. Hypermedia

    Hypermedia is a fancy word for data sent from the server to the client that contains information about what the client can do next–in other words, what further requests it can make. (Optional)

  8. Cache refers to the constraint that server responses should be labelled as either cacheable or non-cacheable. It should contains in the “self-descriptive message” described mentioned above.
  9. Layered system refers to the fact that there can be more components than just servers and clients. For example, proxy as load balancer, or security checker, authenticator, gateway.
  10. Code on demand works as the ability for a server to send executable code to the client, for example, HTTP server send <script> tag to client to execute locally.

Enable Xdebug on PHP development

Xdebug is crucially important for PHP stack trace debugging.

Here is what I have done to get started.

  1. Install Xdebug in ubuntu,
    sudo apt-get install php5-xdebug
    
  2. Modify php.ini to activate Xdebug
    ; file of /etc/php5/fpm/php.ini
    [xdebug]
    zend_extension="/usr/lib/php5/20121212/xdebug.so"
    xdebug.remote_enable=1
    xdebug.remote_handler=dbgp xdebug.remote_mode=req
    xdebug.remote_host=127.0.0.1 xdebug.remote_port=9000
    
  3. Restart the webserver
    sudo service nginx restart
    

On a separate note, for an important feature, now we can call the function,

    // stacktrace debug till the calling point
    var_dump(xdebug_get_function_stack());

    // stacktrace printing message till the calling point
    xdebug_print_function_stack( 'Your own message' );

Thess functions will find out all the trace back for current calls.
First one will display every parameters within it. Extremely useful for debugging through large codebase.


Or follow the instruction on Xdebug. (Cited from Xdebug)

Instructions

Download xdebug-2.3.3.tgz
Unpack the downloaded file with tar -xvzf xdebug-2.3.3.tgz
Run: cd xdebug-2.3.3
Run: phpize (See the FAQ if you don’t have phpize.

As part of its output it should show:

Configuring for:

Zend Module Api No: 20121212
Zend Extension Api No: 220121212
If it does not, you are using the wrong phpize. Please follow this FAQ entry and skip the next step.

Run: ./configure
Run: make
Run: cp modules/xdebug.so /usr/lib/php5/20121212
Update /etc/php5/fpm/php.ini and change the line
zend_extension = /usr/lib/php5/20121212/xdebug.so
Restart the webserver

Play with Node – Express, Jade and Mongoose

Recently tried node developing a handy tool for geo-encoding logger. View demo.
The object of this page is to utilize Google auto-complete tool and GeoEncoding API to log geometry coordinates about several locations of interest.

On the page,

  1. User will be able to type on the input box to find the location
  2. Once location chosen, while it is pinned on the map, its geometry location along with its address attributes is logged into our db (Mongo) using ajax
  3. Page should dynamically change once new record is added
  4. User will be able to delete the location from record list
  5. Page also should dynamically change once new record is deleted
  6. No Same location should be recorded twice

If interested, here is Source Code

On the design, I want it to be one-page application. I want to try on NodeJs for the cool syntax. I want to utilize Mongo DB since the address components could be extremely dynamic. Here is what I have done.

  1. Due to vulnerability of expressJs in Node, choose PM2 to manage the application
  2. Use nginx as reverse proxy to handle request to Node
  3. Simple route coding using express, to handle get request on page load, and post request on Ajax call
    Express body-parser to handle json related requests.
  4. Use Jade + bootstrap to mock up the page, its really handy to handle all the page mock up using jade simplified syntax.
  5. Load Google API to render the map and Auto-complete plugin
    Since V3, google Javascript plugin no longer requires API key to load the frontend js calls
  6. Fire customize event to export location geometry parameter, and capture it to post ajax request
    From here on, comparing backend js file & frontend (client-side) js file, I really felt that Node is a strongly MVC structured code. Developing backend requires pm2 restart app. Developing frontend is change-and-go. This is very much like java developing.
  7. Data Related, chosing Mongoose to map the object from db. In schema definition,
    • use non-strict {strict: false} to dynamically add unexpected attributes inside model.
    • apply unique index on combination of latitude & longtitude index( { 'latitude' : 1, 'longtitude' : 1 }, {unique: true} ) to prevent duplication
    • add pre filter to update, to change update_at field whenever records changed (Somehow, findByIdAndUpdate is not triggering this pre filter.)
    • update status => “disable” instead of remove to keep a track of every bit of data input
  8. On page HTML Dom change according to user action (adding & deleting), No much of worth mentioning.

Database Normalization

Database normalization is the common term to describe an good schema of db tables. Design tables with keeping normalization in mind.

1NF: each row has a unique identifier, and don’t contain any repeating group of data in same column.

2NF: plus, none of partial independency of any column on primary key.

3NF: plus, every non-prime attributes must be dependent on primary key.

Normalization aims to reduce data redundancy, usually with break-down tables.

There are also BCNF, 4NF, 5NF. (More to come).

Decorator v.s. Inheritance

Performance-wise compare decorator & inheritance

Decorator:

  • Object isolated logic
  • Dynamically configured in Runtime
  • Add multiple stack of decorators to combine functionality at runtime
  • Involving creating many small objects, and complicating the process of initializing & further debug.

Inheritance:

  • Class dependent
  • Static define function & attributes
  • Different object initialization with multiple inherited class which is pre-defined
  • Comparatively simple to initial object and isolate bug related to certain class

Great example code cited from StackOverflow

Decorator style:

Stream sPlain = Stream();
Stream sEncrypted = EncryptedStream(Stream());
Stream sZipped = ZippedStream(Stream());
Stream sZippedEncrypted = ZippedStream(EncryptedStream(Stream());
Stream sEncryptedZipped = EncryptedStream(ZippedStream(Stream());

and Inheritance style:

class Stream() {...}
class EncryptedStream() : Stream {...}
class ZippedStream() : Stream {...}
class ZippedEncryptedStream() : EncryptedStream {...}
class EncryptedZippedStream() : ZippedStream {...}