Saturday, December 15, 2012

Running Teradata Aster Express on a MacBook Pro

To those people who know me, know that I am complete Apple geek.  Teradata's supports BYOD, so naturally I have a MacBook Pro. Why wouldn't I want to run Aster Express on it  :-)

The configuration was surprisingly easy once I worked out how!

Prerequisites

  • 4 GB memory - (I have 8GB in mine)
  • At least 20 GB free disk space
  • OS: I am running Mac OS X 10.8.2
  • VMware Player: -  Buy VMWare Fusion. I have version 5.0.2 (900491)
  • ** Make sure to order the professional version, as only this version has the new
  • ** Network Editor feature
  • 7-Zip: To extract (or uncompress) the Aster Express package.

You get to the Network Editor from Preferences. Create a new network adapter vmnet2 as shown in the screen shot below:



Then make sure that for both the Queen and Worker VMWare images you assign vmnet2 as your network adapter as illustrated in the screenshot below:





That is really the only changes you need to make. Follow the rest of the instructions as outlined in the Getting Started with Aster Express 5.0 to get your Aster nCluster up and running.

If you have 8GB of memory you might decide to allocate 2GB of memory to each VM instead of the 1GB which is the default. Again you can set this in the settings for each VMWare image.  I also run the utility Memory Clean (available for free from the App Store). You would be amazed how much a memory hog FireFox and Safari can be. I normally shutdown most other running programs when I am working with Aster Express to give me the best user experience.

To run the Aster Management console point your favourite browser to https://192.168.100.100 You may ignore any website security certificate warnings and continue to the website.


You will also find mac versions of act and ncluster_loader in /home/beehive/clients_all/mac. I just copy them to my host. In fact, Once I start up the VMWare images, I do most everything natively from the Mac.

In future posts I plan to cover the following topics:
  • How to scale your Aster Express nCluster and make it more reliable
  • Demo: From Raw Web log to Business Insight
  • Demo: Finding the sentiment in Twitter messages
If there are topics you would like me to cover in the future, then just let me know.


Wednesday, October 24, 2012

My perspective on the Teradata Aster Big Analytics Appliance

Aster Big Analytics Appliance 3H

By now, no doubt you have heard the announcement of our new Teradata Aster Big Analytics Appliance. In the interests of full disclosure, I work for Teradata within the Aster CoE in Europe. Prior to joining Teradata,  I was responsible for a large complex Aster environment which was built on commodity servers in excess of 30 TB of usable data with a 24 x 7 style operational environment. So my perspective in this post is from that standpoint and also recalling the time when we went under a hardware refresh and software upgrade.

OK, First of all you procure 30 servers and at least two network switches (for redundancy). When you receive them, it up to your data centre team to rack them and cable them up. Next,  check the firmware on each system is the same, surprise, surprise they aren't, so a round of upgrades later, then you configure the raid controllers. In this case we went for Raid 0 which maximises space, more on that choice later...

Then it is over to the network team, to configure the switches and the VLAN we are going to use. We then put on a basic Linux image on the servers so we can carry out some burn in tests, to make sure all the servers have a similar performance profile. Useful tests, as we found two servers whose raid controllers were not configured correctly. It was the result of human error, I guess manually doing 30 servers can get boring. This burns through a week, before we can install Aster, configure the cluster and bring all the nodes online to start the data migration process. Agreed, this is a one off cost, but in this environment, we are responsible for all hardware issues, network issues, Linux issues, with the Vendor just supporting Aster. Many customers never count that cost or possible outages that might be avoided because of these one off configurations.

We had some basic system management as these are commodity servers but nothing as sophisticated as Teradata Server Management and the Teradata Vital Infrastructure. I like that with the Server management software it allows me to manage 3 clusters within the Rack logically (e.g. Test, Production, Backup). I also like the proactive monitoring, as it is likely they will identify issues prior to them becoming an outage for us, or an issue found with one customer can be checked against all customers. If you build and manage your own environment, you don't get that benefit.

Your next consideration should be when looking at an appliance, is it leading edge and how much thought has gone into the configuration? The Appliance 3H is a major step forward from Appliance 2. From a CPU perspective, it has the very latest processors from Intel, dual 8 core Sandy Bridge @ 2.6GHz. Memory has increased to 256GB. Connectivity between nodes is now provided by Infiniband at 40Gb/s. Disk drives are the newer 2.5 size, enabling more capacity per node. Worker nodes using 900GB, while backup and the Hadoop nodes leveraging larger 3TB drives. RAID 5 for Aster and RAID 6 for the backup and Hadoop nodes. Also with the larger cabinet size, enables better data centre utilisation with the higher density that is possible.

I also like the idea of providing integrated Backup nodes as well, previously Aster just had the parallel backup software only, you had to procure your own hardware and manage it. We also know that all of these components have been tested together, so I am benefiting from their extensive testing, rather than building and testing reference configurations myself.

What this tells me, is that Teradata, can brings advances in hardware quickly to the marketplace. Infiniband will make make an important difference. For example, for very large dimension tables, that I have decided against replicating, joins will run much faster. Also I expect positive impact on Backups. In my previous environment, it took us about 8 hours for a full backup of 30 TB or so. Certainly the parallel nature of their backup software could soak-up all the bandwidth on a 10GB connection, so we had to throttle it back.  On the RAID choices, I absolutely concur with the RAID 5 choice. If I was building my own 30 node cluster again I wouldn't have it any other way. While the replication capabilities in Aster protects me against at least any single node failure, a disk failure, will bring that node out of the cluster, until the disk is replaced and the node is rebuilt and brought back online. When you have 30+ servers each with 8 drives (240+ disk drives) the most common failure will be the disk drive. With RAID 5, you can replace the drive, without any impact on the cluster at all, and you still have the replication capabilities to protect yourself from multiple failures.

I also like the option of being able to have an Hadoop cluster tightly integrated as part of my configuration. For example if I have to store a lot of 'grey data' e.g. log/audit files etc for compliance reasons,  I can leverage a lower cost of storage and still do batch transformations and analysis as required. Bring a working set of data (last year for example) for deeper analytics. With the transparent capabilities of SQL-H, I can extend those analytics into my Hadoop environment as required.

Of course purchasing an appliance, is a more expensive route than procuring it, building and configuring it all yourself. However, most enterprise are not hobbyists, and building this sort of infrastructure, is not their core competence nor is is bringing value to their business. They should be focused on the Time to Value and with the Teradata Aster Big Analytics appliance the time to value will be quick, as everything, is prebuilt, configured, tested and ready to go, to accept data and start performing analytics on it.  As I talk to customers across Europe this message is being well received when you talk through the details.

I'll leave you with this thought, one aspect of big data that I don't hear enough of is Value. To me, the dimension of Volume, Variety, Velocity and Complexity are not very interesting if you are not providing value by means of actionable insights. I believe every enterprise customer needs a discovery platform capable of executing the analytics that can provide them an important competitive edge over their competition. This platform should have the capability to handle structured as well as multi-structured data. It should provide a choice of analytics, whether they be SQL based, MapReduce or statistical functions. It should provide a host of prebuilt functions to enable rapid progress. It should be a platform that can appeal to power users in the business, by having a SQL interface and that will work with their existing visualisation tools to our most sophisticated data scientists, by providing them a rich environment to develop their own custom functions as necessary, while enabling them to benefit both from the power of SQL and Map Reduce to build out these new capabilities. In summary that is why I am so excited to be talking with customers and prospects about Teradata Aster Big Analytics Appliance.






Thursday, October 11, 2012

Aster integration into the Teradata Analytical ecosystem continues at pace…

Not long after Teradata acquired Aster in April last year we outlined a roadmap as to how Aster would integrate into the Teradata Analytical ecosystem.

Aster-Teradata Adapter


Clearly, the first priority was to delivera high speed interconnect between Aster And Teradata. The Aster-Teradata adapter is based on the Teradata Parallel Transporter API and provides ahigh-speed link to transfer data between the two platforms. It allows parallel data transfers between Aster and Teradata, with each Aster Worker connecting toa Teradata AMP. This connector is part of the Aster SQL-MR library, with all data transfers initiated through AsterSQL.
The Aster-Teradata adapter offers fast and efficient data access. Users can build views in the Aster Database on tables stored in Teradata. Aster Database users can access and perform joins on Teradata-stored data as if it were stored in the Aster Database. Data scientists can run Aster’s native analytic modules,such as nPath pattern matching, to explore data in the Teradata Integrated DataWarehouse. Users now have the capability to Investigate & Discover in Teradata Aster, then Integrate & Operationalize in the Data Warehouse.

This example below shows the load_from_teradata connector being called from within an Aster SQL query:

SELECT userid, age, sessionid,pageid

FROM nPath(

ON (

select * fromclicks,

load_from_teradata(

on mr_driver tdpid(‘EDW')

credentials ('tduser’)

query(‘SELECT userid, age, gender, income FROM td_user_tbl;')

) T

where clicks.userid =T.userid )

PARTITION BY userid, sessionid

RESULT ( FIRST(age of A) as age, … )

This example shows the load_to_teradata component, with analytic processing on Aster withresults being sent to a Teradata target table:

SELECT *


FROM load_to_teradata(

ON ( aster_target_customers )

tdpid (‘dbc’)

username('tduser’)

password(‘tdpassword')

TARGET_TABLE(‘td_target_customer_table’)

);

Viewpoint Integration

We have just announced Aster integration with Teradata Viewpoint, with release of 14.01 of Viewpoint and Aster 5.0. Viewpoint's single operational view (SOV) monitoring has been extended to include support for Teradata Aster. Teradata wanted to leverage as many of the existing portlets as possible for easier navigation, familiarity, and support for theViewpoint SOV strategy. So the following existing portlets were extended to include support for Aster:

System Health

Query Monitor

Capacity Heatmap

Metrics Graph

Metrics Analysis

Space Usage

Admin - Teradata Systems


However not all the needs of Teradata Aster's differing architecture made sense to put into an existing portlet. Therefore there are two new Aster specific portlets in this release.

Aster Completed Processes

Aster Node Monitor

Some screenshots:
























Check out the complete article on Viewpoint 14.01 release at the Teradata Developer Exchange here.
Stay tuned for some very exciting news coming next week...



Friday, October 05, 2012

Don't be left in the dark... Exciting Teradata Aster Announcement coming soon



We’re gearing up for a big Teradata Aster announcement. Mark your calendars to get the scoop on  Oct 17th. We will be hosting a live Webinar to announce it. Register here for it

Believe me, it is going to be amazing...

Wednesday, October 03, 2012

The Big Data Analytics Landscape

As a technologist and evangelist working in the big data marketplace it is certainly exciting. I am excited by the new products we are bringing to market and how this new functionality really helps to bridge the gap for Enterprises adoption. It is also surreal, in terms of the number of blog posts, tweets on Big Data and there seems to be a new big data conference cropping up on a weekly basis across Europe :-)

It is interesting to monitor other vendors in the marketplace and how they position their offerings. There is certainly a lot of clever marketing going on (that I believe in time will show a lack of substance) and some innovation too, . You know who you are.... But jump aboard the bandwagon. Just because you might have Hadoop and your database + Analytics within the same rack that doesn't mean they are integrated.

It is also interesting the fervor that people bring when discussing open source products. Those people who know me, know that I am a long time UNIX guru over 20 years from the early BSD distributions to working with UNIX on mainframes, to even getting minix to work on PC's before Linux came along. Fun times, but I was young free and single and enjoyed the technical challenge. I was also working in a research department in a university. However the argument that Hadoop is free and easy to implement and will one day replace data warehousing, doesn't ring true for me. Certainly it is true is has a place, and does provide value, but it doesn't come at no cost. Certainly Hortonworks and Cloudera provide distributions that are reducing the installation/configuration and management effort, but you have multiple distributions, starting to go in different directions? MapR for example?

How many enterprises really want to get that involved in running and maintaining this infrastructure. Surely they should be focused on identify new insights that provides business benefits or gives greater competitive advantage. IT has an important role to play, but it will be the business users ultimately that need to leverage the platform to gain these insights.

It is no use getting insights, if you don't take action on them either.

Insight gained from big data analytics should be fed into existing EDW (if they exist) so they can enhance what you already have and the EDW provides you with a better means  of operationalizing the results.

I say to those people who think Hive is a replacement for SQL, not yet it ain't, it doesn't provide the completeness or performance that a pure SQL engine can provide. You don't replace 30+ years of R&D that quickly...

To the NoSQL folks, this debate is taking on religious fervour at times, It has a role, but I don't see it replacing the relational database overnight either.

In a previous role I managed a complex DB Environment that included a Big Data platform for a company that operated in the online gaming marketplace in a very much 24 X 7 environment, with limited downtime. It was the bleeding edge at times, growing very fast.  If we had Teradata Aster 5.0 then, my life would have been so much easier. Se had an earlier release but we learned a lot. We proved the value of SQL combined with the Map Reduce programming paradigm. We saw the ease of scaling and reliability, We delivered important insights into various types of fraud, and took action on them, which yielded positive kudos for the company and increased player trust, which is very important in an online marketplace. We also were able to leverage the platform for an novel ODS requirement and had both executing simultaneously along with various ad-hoc queries. I was also lucky then and since to meet real visionaries, like Mayank and Tasso which gives you confidence in the approach and the future direction

When you think of big data analytics, it just not just about multi structure data or new data sources. Using SQL/MR for example may be the most performant way to yield new insights from existing relational data. Also consider what 'grey data' already exists within your organisations, it maybe easier to tap into that first, before sourcing new data feeds. The potential business value should drive that decision though.

Do not under estimate the important of having a discovery platform as you tackle these new Big Data Challenges. Yes, you will probably need new people or even better, train existing analysts to take on these new skills and grow your own data scientists. The ease of this approach, will be in how feature rich your discovery platform is, How many built in and useful analytical functions are provided to get you started, before you may have to develop specific ones of your own.

I suppose,  some would say I am rambling with these comments and not expressing them very elegantly, but help is at hand :-). We recently put together a short webinar, I think it is about 20 minutes duration. 

The Big Data Analytics Landscape: Trends, Innovations and New Business Value, featuring Gartner Research Vice President Merv Adrian and Teradata Aster Co-President Tasso Argyros.  In the video, Merv and Tasso, answer these questions and more, including how organizations can find the right solution - to make smarter decisions, take calculated risks, and gain deeper insights than their industry peers.

  • How do you cost-effectively harness and analyze new big data sources?
  • How does the role of a data scientist differ from other analytic professionals?  
  • What skills does the data scientist need?
  • What are the differences between Hadoop, MapReduce, and a Data Discovery Platform?
  • How are these new sources of big data and analytic techniques and technology helping organizations find new truths and business opportunities?
I suggest if you have the time to spare... watch the video

What do you think?

For me it is all about the analytics and the new insights that can be gained and acted upon

Saturday, June 09, 2012

First Steps in Exploring Social Media Analytics

As I talk with customers and colleagues the topic of social media analytics is often featured.  Some customers have already got a strategy defined and are executing to a plan while others are at a more nascent stage but believe in the potential to have a direct and almost immediate connection to their customers.

I'll admit that I am somewhat of a social media novice so it will be a learning experience for me too. I am intrigued by the depth of analytics that may be possible. Only this month did I setup my profile on Facebook and I'm 47! I have been using Twitter more regularly of late, since our Teradata Universe conference and I probably look at it 3 or 4 times a day, depending on what's on my schedule for the day. I am finding some interesting, funny and informative updates each day, as I slowly expand the number of people I follow. It is really a mixture of friends and work related contacts at the moment.  I have been a member of LinkedIn for a number of years and find it a useful resource from a professional perspective.  I am within the first 1% of members who subscribed to this site (I received an email to that effect, that I was within the first 1 million sign ups when the site hit 100 million). Finally I am keen on blogging more frequently when I have something interesting to share (i.e. this! :-) ) I had stopped blogging for about 5 years at one point.  I have also started with Flickr and YouTube as well. I'll be my own guinea pig in some ways as I explore and experiment on possible useful analytics in these social media channels.

However when most people think of Social Media and associated analytics, Facebook and Twitter are often mentioned first.   These social media systems do provide an API that can be used to readily access data, and these split into two broad categories that reflect the social media’s attitude to customer data. 

The first approach is the open approach adopted by Twitter.  Users on Twitter are warned that their posts are visible to anyone (who can find them). The second approach is that adopted by Facebook. There is an extensive privacy model, and data needs to be accessed using authorizations (from likes and games).

Twitter
  • Data is free but there is a lot of it
  • Identifying relevant stuff isn’t easy
  • History isn’t usually available

Facebook
  • Public data is free through search
  • Basic data available through application
  • More sophisticated data available with permissions (using Oauth 2.0)
 
Focusing on Facebook and Twitter you see two very different levels of information.  Twitter provides only basic, row level data. Facebook provides much more complex, relational data. We'll explore these in more detail in future posts.

Data from social media must be linked in three ways:
·      Within the social media itself
·      Across multiple social media
·      Between social media and the bank’s core data

The most secure forms of linking are to use unique references: email addresses, IP addresses and telephone numbers.  This can be supported by direct access methods (i.e. asking the user for their Twitter name, or persuading them to Like the bank on Facebook from within a known environment).

However, even then the confidence in the link must be evaluated and recorded: this information is user provided and may be wrong in some cases. The notion of a “soft match” should be adopted – we think that this is the same person, but we cannot be sure.

I would like to end this post with a recommendation to read the following white paper by John Lovett from Web Analytics Demystified  Beyond Surface-Level Social Media. Lovett, who has written a book on Social Analytics , lays out a compelling vision for Deeper Social Analytics for companies.  He clearly presents the value for companies to go beyond surface level analytics of likes, followers and friends and challenges you to ask deeper and more important questions. This white paper has been sponsored by Teradata Aster and is available for free from here.

In reading this white paper you will gain an understanding of the term 'Surface-Level Social Media' coined by John and how it is possible to gain competitive advantage even operating at this level. He will outline how Generation-Next Marketing is being powered by Social Analytics backed up with a number of interesting customer examples. He goes on to outline a 7 point strategy to build your deeper social media strategy. Finally John concludes with how unstructured data can yield valuable customer intelligence.

I found it to be very informative and well written and gave me a number of new insights and points to ponder. I would be interested in your thoughts on it too. 

Enjoy!









Wednesday, May 23, 2012

Upcoming WebCast: Bridging the Gap--SQL and MapReduce for Big Analytics



On Tuesday May 29th, Teradata Aster will be hosting a web cast to discuss the Bridging the Gap--SQL and MapReduce for Big Analytics. Expected duration is 60 minutes and will start at 15:00 CET (Paris,Frankfurt) 14:00 UTC (London). You can register for free here.

We had run this seminar earlier in May but at a time which was more convenient for a US audience. The seminar was well attended and we received good feedback from attendees that encouraged us to rerun it again with some minor changes and at a time more convenient for people in Europe.

If you are considering a big data strategy, confused by all the hype that is out there, believe that Map Reduce = Hadoop? or Hive = SQL?, Then this is an ideal event for a business user to get a summary of the key challenges, the sort of solutions that are out there and the novel and innovative approach that Teradata Aster has taken to maximise time to value for companies considering their first Big Data initiatives.

I will be the moderator for the event, and will introduce Rick F. van der Lans, independent analyst and Managing Director of R20/Consultancy, based in the Netherlands. Rick  is an independent analyst, consultant, author and lecturer specializing in Data Warehousing, Business Intelligence, Service Oriented Architectures, and Database Technology. He will be followed by Christopher Hillman from Teradata. Chris, is based in the United Kingdom and recently joined us a Principal Data Scientist. We will have time at the end to address questions from attendees.

During the session we will discuss the following topics:

  • Understanding MapReduce vs SQL, UDF's, and other analytic techniques
  • How SQL developers and business analysts can become "data scientists"
  • Fitting MapReduce into your BI/DW technology stack
  • Making the power of MapReduce available to the larger business community


So come join us on May 29th. It will be an hour of your time well invested. Register for free here.