Friday, 21 February 2014

Retail Fashion - Fashionista For Android

This is a free software. Retail Fashion at your finger tips. Check it out on your Android device.

Do you follow Retail Fashion brands like Emporio Armani and Marks and Spencer?  Are you always on the lookout for the latest from fashion retailers?

'Retail Fashion' brings fashion products and Prices on your Android phones and tablets at your finger tips.

Have you gone to a fashion store and forgotten the T-Shirt brand and design you saw online? Wish you could show a photo or the product details? Retail Fashion on Android will do just that.

Next time you are winding up your day and feel tired of starting your laptop to browse fashion retail, remember you have Retail Fashion on Android mobile or tablet.

A) Features
-----------------
1) Find and Save favorite lines/categories. (Check out the screen shots)
2) Sift through saved categories.
3) Save retail items to buy at store.
4) Your data connection/Wifi is used only when needed.
5) Product prices also listed.

B) Info for users
------------------------
a) Current support for Emporio Armani US and Marks and Spencer UK. More retailers will follow.
b) Wish to add a retailer? or a feature. Contact the developer!
c) "Retail Fashion on Android" is the store listing name. App name on your phone is Fashionista.







C) Information for retailers
-----------------------------------
a) "Retail Fashion" cloud servers do not store images from Retailers.

*b) Do you want to reach out to millions of users with this mobile software branded exclusively for you? Get in touch with the developer.

D) Software license agreement
---------------------------------------
https://docs.google.com/document/d/1doaUJOmL1GTpq-geniMxhr-o5MUDriyABly-BiOiZOQ/edit?usp=sharing

Friday, 10 January 2014

Task Queues in Python: Getting work done on Google AppEngine Cloud

The previous post mentioned task queues in AppEngine's application model. This post is an example of push queues for Google AppEngine in Python 2.7. Task queues come close to the concept of getting work done in parallel in an application. In addition to that, it is closely associated with doing work in small chunks rather than in total. Task queues are useful when you have to do deferred tasks. If the task does not need to be done immediately or if the result need not be shown to the user immediately, then queue it for deferred execution.

AppEngine tasks are put in queues. In a push queue, AppEngine takes tasks off a queue and processes them. When you write the code to pull tasks too then it is a pull queue. In a pull queue Apps need to take a finished task off the queue too. By default queues are push queues. Once tasks are on a queue, we need to execute them. Task execution is realised by writing url handlers for each queue. For every queue there is a url and a handler. The url for a queue is of the form /_ah/queue/<queue name>. When AppEngine is taking tasks off a push queue for execution, it POSTs to this url for the queue. The application needs to handle this post request. The post code is where the application does what is needed to be done in a task. Task queues also help in times when an AppEngine instance was terminated in the middle of a task. Each task has a task name and AppEngine remembers the name for some time. So duplicate tasks cannot be queued. There are mechanism to adjust the frequency/speed of queue execution. Each queue executes a task on it if, it has a token. Tokens are available in buckets and tokens are replenished. Apps can specify the rate of replenishing and the bucket size too. If five tasks are on the queue and there are 4 tokens, 4 of those tasks are executed. The remaining wait until replenishment. Tasks also provide a built-in mechanism to recover from failure. Failed tasks are retried automatically. The mechanism for retries can be controlled. Applications can specify task age, retry limit and back off mechanism. Again tasks queues have a target which can be pointed to a backend instance, in which case the back end executes the queue's tasks.

An example of a GAE task push queues:
Python sample code for task queues is available here. In this example, a master queue is used to hold a master task. A master task is an aggregate task with subtasks. A sub task in this example simply counts from A to B and return a list. Master task specifies subtasks for a number of intervals A & B. If subtasks are done. Then master task is done too. The master task is pushed on to its queue. From there, it is taken off the queue and does its job of creating subtasks. The master task pushes subtasks on to a separate queue for execution. i.e We tie the task queues together here. Sub tasks get executed as and when their time comes. A unique master task (name, refer code) cannot be duplicated. The same is true for subtasks. In this example, The subtasks simply write the counts to the backends log. The log is also shown below. The back end also has a shutdown hook. (refer sample code). Parameters to tasks can be sent using key-value pairs and as payload.

a) Backend definition in yaml file. This backend is the target for queues in this code.

b) Queue definition in yaml.

c) App yaml file. Watch the url handlers for queues

d) Code to enqueue an item to a task with check for tombstoned and duplicate tasks. Task params are sent using key value pairs. Here the value is a pickled task object.
e) Master Queue status after run
f) Sub tasks in seperate queue
g) Log showing the execution

Code download for this app is available here.

Monday, 16 December 2013

Cloud Computing Application Development Paradigm: App Engine From Google

App Engine is an execution environment as a service. Behind the scenes this is a super set of all the other 'as a service' models namely infrastructure as a service, software as a service and platform as a service. Developers just build and deploy while all the software, infrastructure, platform and the like are addressed implicitly. 

In a traditional software deployment scenario, the production servers are prepped with application server software, databases, user accounts, recovery/backup in the form of standby mirrors etc. Here app deployment does not need any of that. Users still get to provision websites and web apps which will scale with demand. Plus, developers do not get hold of a virtual machine like Amazon cloud. Without a hardware/virtual instance, developers need not install anything separately for production. Memcache and load balancing is built-in to App Engine. Database is replaced by the high replication data store. Web server is out of the question because handling urls is what App Engine does best, everything is built around handling urls (web server like). Cron jobs, task handlers and back-end instances are all triggered and handled via url handlers. All in all, this brings an application development paradigm targeting an execution environment in the cloud.

Since Apps do not get a virtual machine in the cloud on which you can install anything apps wish, applications need to stick to ground rules, use the tools and community guidelines to be fast, reliable and scalable. Otherwise Apps will not be able to cut it. Serving web pages, performing offline tasks like scheduled reports, upgrades to data store models, all require that Apps stick to the rules. For example if an app breaches its memory limit it is terminated. Web apps run on process instances rather than machine instances. These process instances are one of the core enablers for scaling with requests. Instances are of different types with different capacities for memory, CPU, web request/response size, memcache size and offline data storage including logs. Instances can go from 128MB 600MHz to 1024 MB 2.4GHz. Choosing these options directly impacts your app's performance. Billing depends on, where applicable, the services in which you overrun your quota above what you already paid for. 

On a lighter note the billing mechanism can be summed like this. You rent a Ferrari. But then you pay extra 10 cents for every turn of each tyre if you go over 40 miles per hour. And you pay 20 cents if you do a reverse and 15 cents for every oscillation of the windscreen wiper. Every now and then there is a chance that your Ferrari will be terminated and restarted remotely.  The point is, billing and execution environment can be frustrating at first (possibly eternally) if not understood correctly.

The data store will experience contention if an App throws quite a lot of stuff into it sequentially. Contention leads to timeouts. But, App Engine retires data store ops as such. But it is better to have a retry in app logic too with a back off mechanism. This is officially advised too. Data models are really flexible. But queries are learned during development and App Engine readies indices before hand. 

Task recovery mechanisms in case of failure take priority one, if it was not already that. Servers in warehouses have a higher failure probability than servers in house. This is mainly due to the sheer numbers involved. If you have 10 high end machines in-house the chances on one failing within a year could be remote. But on a cloud data center with 10000+ servers the probability of servers going down or needing replacement/maintenance just goes up. As a start up noticed over years that their Amazon instances' life span is on average 200 days! What about App Engine then? We are dealing with process instances so disruptions are pretty much normal and immediate. Instances especially back ends will be terminated without much warning. So either way having your processes running on machines/instances somewhere far away means that there will be disruptions. Apps can register for shutdown event handlers to do something before going down. Also there is no guarantee that this will be triggered. 

Again if Apps are running a huge task say collecting information from the web to process offline and then deliver to the client later; Apps need to address disruptions in the sense that apps need to be able to recover from the point where a huge task failed due to a cloud infrastructure problem. This makes you do two things in particular on App Engine, breakup tasks for independent execution and handle breaks using the platform mechanism. App Engine allows this with Tasks and Task queues. The idea is Apps must break a huge task into manageable units which will be run independently. Otherwise apps risk continuous disruptions from quota overrun such as memory limitations or request timeouts or data store contention. One or the other. 

One robust solution is to have a task chains. Tasks will do a small amount of work and before finishing, queue the next task to be executed. Failed tasks in a queue will be re-tried. App Engine prevents triggering the same task over and over again by using Tombstoned task names. Simply, it remembers a task name say for a day and does not allow a task to be queued in the same name. So imagine your taskA triggered taskB and taskA failed for some reason without spawning the rest of the tasks taskC and taskD. When it comes back online it won't be able to add taskB. taskB will be executed from the queue as and when its time comes. There are a host of retry mechanisms like token buckets and back offs.

Tuesday, 26 November 2013

Google AppEngine BackEnds in Python

AppEngine backends allow us to perform long term, memory and processor intensive work. As the name suggests these are for back-end type work loads. Say for example you are crawling and scraping websites to collect price points from multiple sources to serve later or just doing a report. These types of tasks can be done on the back end. AppEngine backends are separate instances from the front end instances. In terms of quotas, this means a lot. Instead of clubbing all your work on the front-end instances, you can do work on the back end and save hours on the front end. 

An important thing to bear in mind is that AppEngine backends share URL and code with your main app. Backends are part of your app but seperate instances. They are accessible via their names plus the main front end app as shown in the screenshot below. All the urls that your front end app exposes are available to the backends too. For example if your AppName is backendstest2013, you access the app as backendtest2013.appspot.com. And, if you have a back end named backendone, you trigger/invoke the urls on the backend as backendone.backendtest2013.appspot.com/<same URL here>

There are two types of back-ends dynamic and resident. Dynamic backends are started on the first request and stay alive for the request. After some idle time, they are taken down. Resident backends are always on. 

Here is an example of a python backend for AppEngine. A zip of source code is available here. The folder structure for this example is shown below. Project folder is backend_trial.

The App takes two URLs "/" and "/scripttest". The handlers are defined in app.yaml as shown here
There is a controller servlet like handler defined in backendtrial.py. This handler for the main page or default app page just shows some settings. The handler for "/scripttest" inside backendtrial.py is shown below.
AppEngine backends for python are configured in backends.yaml. The file is shown below.

We declare one instance of class B1. It is dynamic and publicly available for access from outside the main App. Class puts limits on your CPU and memory power. B1 is least with 128MB and 600MHz. Class B8 has 1024MB and 4.2GHz. 

The code for handling URL "/scripttest" is shown below. We will access this from both the app instance and back end instance. The code just prints the backend name from the backends.yaml file, if the code is executed from a backend.
Upload the app using command
$ appcfg.py update backend_trial

Upload backend info using command
$ appcfg.py backends backend_trial update

Test the main app and back end by Accessing the app using URLs
1) Main App
2) Backend call
Source for the project is available ->> here

Thursday, 24 October 2013

A coder post : Link inversion traversal to scan tree in constant space

This is a coder's post and discusses a binary tree traversal technique.

Binary tree is a popular data structure and there are a couple of popular tree traversal techniques. One based on recursion. This includes pre-order, in-order and post-order traversals. The second by using an auxiliary data structure to travel the tree level by level. A pre-order traversal visits a node first then its left child and then its right child. For the example tree below, we visit in pre-order NY then Chicago then Seattle. If it was post order we would see the nodes/places as we travel the tree in this order Chicago, Seattle, NY.



A recursive pre-order traversal code is shown below. i.e we visit a node first then, its left and right children in that order.
The second technique uses a stack as an auxiliary data structure to hold the next nodes to visit. This algorithm pops a node of the stack. Visits the node and pushes its children onto the stack; right child first. To start off push the root of tree on to the stack. For the example tree, we push NY onto the stack. Then, until the stack is empty we pop a node i.e NY here off the stack. Visit NY and then push Seattle and Chicago onto the stack. And repeat. You get the idea.

Recursion has memory, performance overhead for recursive function calls. The auxiliary data structure also has memory overhead. Link inversion traversal goes through the tree as if the links form a solid wall. i.e Imagine walking through your home with one hand always touch the wall. You would have followed the wall(s) one after the other. Similarly link inversion starts from the root walks through the links as if they are walls. An example shown below starts at New York and ends at New York. The exclamation shows the visits. Each node is visited thrice. One each for the pre, post and in-order concepts. (Refer the code below).



Link inversion traversal, uses the tree pointers themselves to store backtracking  information. i.e while going down the tree the links are reversed to aid back track instead of an additional data structure. The links will be restored on the way back through the tree thus leaving the tree as it was. Link inversion traversal needs a couple of extra node references. Also, a bit field on each node to signal the direction we went after first looking at the node. So link inversion generally is referred to as scanning the tree in constant space.  A link inversion code snippet is shown below. The full code for the post is available at the end.


Profiling performance: Link inversion traversal is definitely faster than a pre-order traversal.
Input: A balanced binary search tree of 300+ city names. The link inversion code shows up faster on Netbeans profiler. Notice the recursive time and link inversion time in the screen shot below.


Source code for Link inversion traversal is available at the link below:
https://drive.google.com/folderview?id=0BxhHg0qy5gi4dTlWUC1KU0RUUUU&usp=sharing

References:
1) Data structures and their algorithms by Harry R. Lewis & Larry Denenberg.



Thursday, 12 September 2013

Big Data Analytics: Using Python UDFs for better managing code and development

Apache Pig is a data flow scripting language. We specify how we want the data to be grouped, filtered etc. It generates hadoop mapreduce jobs corresponding to pig statements. But Pig does not have functionalities of a general purpose programming language. User defined functions help by allowing us to extend Pig. We can specify our data manipulation along with Pig operators using UDFs. For example if we are filtering data, we can write a UDF to do this our way. Then, ask Pig to use it by saying FILTER by. In a previous post we saw how to write User defined functions in Java.  As mentioned in the post, since Hadoop and pig are already in Java it make sense to write a UDF in Java. Also it was mentioned that writing a UDF in Python could mean less effort and better code management.  

For Java we create a jar file which contains our code for the UDF. We register that jar with Pig. As the number of UDFs goes up and changes become frequent even in development stage this adds to the effort. Changing the code, testing, jar it, then see how it works with pig. This could end up taking a lot of time.(Refer the previous post).

To save time on development iterations, we can embed Apache pig in python scripts. We have seen this too. UDFs can also be in Python which make things a lot easier. There is our pig script embedded in a python file. UDFs can be in the same file or as seperate modules in the same project. The net effect is that you avoid the overhead of heading over to another project to change and jar your UDF. There is also the case of specifying schemas of your UDF's output. This is extra code in Java but, in python it boils down to annotations. Basically less code to write and maintain. Deployment also becomes easier. Python UDFs and simplicity involved are demonstrated below.

Example: We see an example in which we have a dump of data from seismic sensors. We need to find all the locations where there has been an earthquake of magnitude > 5 and we want the number of such quakes over the data. We filter data a UDF and we use another UDF to align data properly.  Python UDFs are used are for sake of demo. The full code for the project is available here.

Screen shot of the project in eclipse. Notice that everything is under one project.

Screen shot of python UDF file

Click for Code used this post

References:
1. Programming Pig by Alan Gates.
2.Embedding Pig in scripting Languages by Julien Le Dem.