Wednesday, 9 May 2012

Digg is not logging in with Chrome

it seems all the giants started to have small problem from the mass inflation of their size
after many issues with twitter, now Digg to follow , While I'm trying to login now, I get an error message, I capture it for the records

this problem happens with Google Chrome ver.18.0.1025.168 m , but works fine with Internet Explorer (I hate Internet Explorer)


Google Drive officially released Today 9 May 2012

Google Drive is a new free service from Google to kill Dropbox, but in general it is free for the first 5GB, while you can get an upgrade to 25GB for $2.50 a month. They say the service is available for PCs, Macs, Android devices, and soon iOS devices.

According to Mercury News, '... the success of Drive will ride largely on whether Google can differentiate its offering from already established fast-growing cloud storage startups that were in the market first, such as Dropbox and Box, as well as Microsoft's SkyDrive service and big consumer media competitors like Apple's iCloud and Amazon's Cloud Drive. ... Existing Google Docs files, the centerpiece of Google's existing cloud storage offering, will move to the Google Drive service once users download apps and install the new service."

Google Drive URL could be found h3r3

Project Sputnik

Project Sputnik is revealed by Dell's Web vertical , which is a stylish laptop tailored for developer needs in web companies.
Quoted:
 'We want to find ways to make the developer experience as powerful and simple as possible. And what better way to do that than beginning with a laptop that is both highly mobile and extremely stylish, running the 12.04 LTS release of Ubuntu Linux', George ponders and, gives a quick list of packages that the default installation could include. The machine will base on the XPS13, assessing a couple of its main hardware deficiencies along the way."

The image of Sputnik which is based on Ubuntu 12.04 LTS release, The install image available for Sputnik contains
  1. drivers/patches for Hardware enablement
  2. a basic offering of key tools and utilities (see the complete list at the end of this entry)
  3. coming soon, a software management tool to go out to a github repository to pull down various developer profiles.

Monday, 5 March 2012

MIT App Inventor Open Beta Preview

AppInventor from MIT allows none developers or better say development fast application for Android
Link is h3r3

Details as per MIT are below:

The MIT Center for Mobile Learning is delighted to announce that we’re meeting our goal of making MIT App Inventor available as a public service in the first quarter of 2012.
For the past two months, we have been conducting a closed test of the system for an increasing number of testers, and we’ve currently scaled to 5000 testers. Today, we’re taking the next step, and opening the MIT App Inventor service to everyone. All you will need is a Google ID for log-in (for example, a Gmail account).
App Inventor will now be suitable for any use, including running classes. But please be aware that this is the first time the system will be under load from a large number of users, so there may be bumps and adjustments as the load increases. For now, we suggest that you maintain backup copies of important apps, as we see how things go.
Of course, there are glitches and minor errors and lots of room for improvement. We’ll be turning our attention to these improvements, once we have more experience with running the system at scale. We will also be developing more resources and support for using App Inventor as a learning tool. We look forward to working with you over the coming months to build the community of App Inventor educators.
We owe a large debt to our testers of the past few months; it’s been their feedback that’s given us the confidence for today’s announcement. And we’re tremendously grateful to the folks who have been running their own system with the MIT JAR files. Their experiences have been an invaluable source of information, and their work has been critical in keeping App Inventor alive while the MIT service was not yet available. We also want to acknowledge the growing group of developers who are starting to explore the App Inventor source code. They are the seeds of an open source community that we hope will take App Inventor beyond anything we could do by ourselves at MIT. And our extreme gratitude and admiration goes to the Google App Inventor team who, even while their project transitions out of Google, have continued to share their expertise and the fruit of their hard work of the past three years.
Please join with us in helping the system move to its next phase as an MIT service. You can learn about MIT App Inventor by visiting our new home at http://appinventor.mit.edu
 

Wednesday, 15 February 2012

Ulteo Open Virtual Desktop (OVD)

Ulteo Open Virtual Desktop (OVD) v3 is available now!
 The new OVD v3 core architecture will allow to add new features more rapidely and more easily. We are going to be very as reactive as possible to customers feedback and projects. We will also offer the possibility to co-fund OVD extensions or additional features for medium and large projects needs"

The Open Virtual Desktop delivers applications run on Linux and Window$ with multiple choice of End user Interface

Screenshots : Link is h3r3
 
System Requirements:
- OVD Application servers: x86 servers w/multi-core or quad CPU. 1GB or more RAM per 15 concurrent end-users. Supported Host OS: Ubuntu 10.04.*, RHEL 5.2,5.3, Centos 5.2+, Fedora 10, OpenSuSE 11.2, generic Linux install.

- Servers for Windows applications: Windows 2003, 2008 or 2008R2 (64bit) Server + Active Directory and Terminal Services on any hardware.

- Servers for OVD Session Manager: x86 server w/512MB or more RAM. Host OS: Ubuntu 10.04.*, RHEL 5.2, 5.3, Centos 5.2+, Fedora 10, OpenSuSE 11.2, generic Linux install.

- Web Client: Oracle Java 1.6 enabled browser: Firefox 2+, Internet Explorer 7+, any platform. Safari 5+ on MacOS.
-  Native Client: Windows® XP/7 or Linux plateform with Oracle Java JRE 1.6 (optional)
- iOS 4.3/5.x client for iPad/iPhone
- Android Client for tablets and phones
-Network: 10Mbps or more LAN
- User directory servers: Active Directory, e-Directory/ZenWork and LDAP servers are currently supported.
- Fileservers: CIFS (Linux and/or Windows), embedded WebDAV server


Link to Utleo the Open Virtual Desktop platform is h3r3

Open Source Cloud Computing

OpenStack which is Open Source Cloud Computing which is an infrastructure-as-a-service(IaaS) platform that orchestrates virtualized servers into an elastic compute environment. The project was originally developed by Cloud.com and is now sponsored by Citrix since they acquired Cloud.com in July of 2011.

CloudStack provides multiple methods for interacting with the CloudStack compute platform. Users can request resources through a rich menu-driven web interface. Operations personnel can use an enhanced version of the web interface or interact with CloudStack’s RESTful API or command line interface (CLI). The new 3.0 UI takes things up a notch making it very intuitive for users to administer their own cloud computing so administrators can delegate infrastructure provisioning and focus on more high value tasks than spinning up servers.
Another thing that I think sets CloudStack apart is it’s networking-as-a-service capabilities. CloudStack administrator can create any number of custom network offerings in addition to the default network offerings provided by CloudStack.  These offerings can be attached to the virtualized machines deployed by Cloudstack. Cloudstack allows user to choose the type of network architecture that best fits their needs.  Out-of-the-box support includes the Basic Network, or flat network mode or advanced networking VLAN support and integration of network elements including external firewalls and load balancers. Administrators can offer different classes of service on a single multi-tenant physical network with a combination of networking offerings that include DHCP, Source Network Address Translation (NAT), Gateway, Load Balancing, Firewall, VPN, Port Forwarding.
You can get the details on the beta of CloudStack 3.0 from the CloudStack open source project and the GA version should be available in the upcoming weeks.

Enjoy the Cloud the Right Flavour, the Open Source Flavour

Dashboard  Screenshot

Tuesday, 8 November 2011

Microsoft contributes open-source code to Samba

Microsoft has contributed source code under the GPLv3 to Samba, the file server software that enables Linux servers to share files with Windows PCs.
Freak snowstorm reported in hell. Tea party agrees Obama is the best candidate for 2012 presidential election. Microsoft submits open-source code under the GPLv3 to Samba. Those are all pretty unlikely, but Microsoft really did submit code to the Samba file server open-source project.
This might not strike you as too amazing. After all, Microsoft has supported some open-source projects at CodePlex for some time now and they will work with some other projects such as the Python and PHP languages and the Drupal content management system (CMS). But, Samba, Samba is different. They’re an old Microsoft enemy.
Samba, itself, is a set of Windows interoperability programs that provide secure, stable and fast file and print services for all client operating systems that use the Server Message Block (SMB)/Common Internet File System (SMB/CIFS) protocol. As such Samba is used to seamlessly integrate Linux/Unix servers and desktops into Active Directory (AD) networks using the Winbind daemon. In common usage, Samba is on almost every network attached storage (NAS) device that ships today. In short, Samba enables Linux to rival Windows Server on workgroups.
In fact, it was Samba on Linux that took Linux from being an edge server, used for Web serving and e-mail, to being an infrastructure server. With Samba, Linux delivers the bread and butter of file and print serving that every business needs in millions of companies.
Since Samba began in 1992, Microsoft has been well, less than happy, with its server rival. But, every since Microsoft lost an anti-trust case in the European Union and was forced to open its network protocols to Samba in 2007, Microsoft has ever so slowly been getting along better with Samba.
But, even so it came as a surprise when on October 10th, when Stephen A. Zarko of Microsoft’s Open Source Technology Center, gave Samba some proof of concept code for extended protection (channel and service binding) for Firefox and Samba for NT LAN Manager (NTLM) authentication. That’s one small step for open source, one giant leap for Samba/Windows interoperability.
As Chris Hertel of the Samba Team wrote, “A few years back, a patch submission from coders at Microsoft would have been amazing to the point of unthinkable, but the battles are mostly over and times have changed. We still disagree on some things such as the role of software patents in preventing the creation of innovative software; but Microsoft is now at the forefront of efforts to build a stronger community and improve interoperability in the SMB world.”
Hertel continued, “Most people didn’t even notice the source of the contribution. That’s how far things have come in the past four-ish years. …but some of us saw this as a milestone, and wanted to make a point of expressing our appreciation for the patch and the changes we have seen.”
Jeremy Allison, one of Samba’s leaders and a software engineer at Google Open Source Programs Office told me that he was “really pleased. It does show that Microsoft now consider us part of the landscape they inhabit, and cooperating with us is a really good sign that engineering-wise they understand Free Software/Open Source is a really good thing that can help them also (not to put words in their mouth, but I think recent work from them on Hadoop [An Apache open-source framework for reliable, scalable, distributed computing] and others have shown this).
That said, “Sending code to Samba is a big deal due to historical legacy of the EU lawsuit, and shows that Microsoft is becoming a mature member of the OSS [open source software] ecosystem,” said Allison.
He continued, “Now if they’d only stop threatening OSS over patents, and just tried to make money with it the same way everyone else does by building it into products (they’re nearly there I think), I think we could finally bury the hatchet :-).”
“But,” Allison concluded, “I want to be fair to the guys who sent the patch, that’s another department in Microsoft (the one who is suing people :-). These guys are in the OSS-lab in Microsoft and they’re great!”
OK, I was wrong. One amazing thing hasn’t happened. Two amazing things have happened. First, Microsoft has contributed code of its own free will to a former enemy, Samba. And, second, one of Samba leaders and a well-known champion of open-source software is saying that people at Microsoft are great. It’s a day of miracles!

Oracle is taking everything from SUN, or may be erasing it

I feel a pity while I see ORACLE is erasing SUN from the map, I have noticed the following
1) Blogs on SUN domain now is directed to ORACLE
2) SSL Certificate is now replaced by Oracle SSL Certificates and not even configured properly
Check the below pics
Good Bye SUN, I will miss You
you go to blogs.sun.com you find yourself receiving Oracle Wild Card SSL Certificate



Friday, 4 November 2011

what is Hadoop ?

Hadoop is open source project for Data Mining, it is now by Apache Foundation, if you see the names that are using it for data mining you will know the power of this project.

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.

Project page is h3r3


Names/Companies using Hadoop Project , Ref link is h3r3

A


  • A9.com - Amazon*
    • We build Amazon's product search indices using the streaming API and pre-existing C++, Perl, and Python tools.
    • We process millions of sessions daily for analytics, using both the Java and streaming APIs.
    • Our clusters vary from 1 to 100 nodes
    • We use a Hadoop cluster to rollup registration and view data each night.
    • Our cluster has 10 1U servers, with 4 cores, 4GB ram and 3 drives
    • Each night, we run 112 Hadoop jobs
    • It is roughly 4X faster to export the transaction tables from each of our reporting databases, transfer the data to the cluster, perform the rollups, then import back into the databases than to perform the same rollups in the database.
    • We use Hadoop and HBase in several areas from social services to structured data storage and processing for internal use.
    • We currently have about 30 nodes running HDFS, Hadoop and HBase in clusters ranging from 5 to 14 nodes on both production and development. We plan a deployment on an 80 nodes cluster.
    • We constantly write data to HBase and run MapReduce jobs to process then store it back to HBase or external systems.
    • Our production cluster has been running since Oct 2008.
    • We use Flume, Hadoop and Pig for log storage and report generation aswell as ad-Targeting.
    • We currently have 12 nodes running HDFS and Pig and plan to add more from time to time.
    • 50% of our recommender system is pure Pig because of it's ease of use.
    • Some of our more deeply-integrated tasks are using the streaming api and ruby aswell as the excellent Wukong-Library.
  • Able Grape - Vertical search engine for trustworthy wine information
    • We have one of the world's smaller hadoop clusters (2 nodes @ 8 CPUs/node)
    • Hadoop and Nutch used to analyze and index textual information
  • Adknowledge - Ad network
    • Hadoop used to build the recommender system for behavioral targeting, plus other clickstream analytics
    • We handle 500MM clickstream events per day
    • Our clusters vary from 50 to 200 nodes, mostly on EC2.
    • Investigating use of R clusters atop Hadoop for statistical analysis and modeling at scale.
  • Aguja- E-Commerce Data analysis
    • We use hadoop, pig and hbase to analyze search log, product view data, and analyze all of our logs
    • 3 node cluster with 48 cores in total, 4GB RAM and 1 TB storage each.
    • A 15-node cluster dedicated to processing sorts of business data dumped out of database and joining them together. These data will then be fed into iSearch, our vertical search engine.
    • Each node has 8 cores, 16G RAM and 1.4T storage.
  • AOL
    • We use hadoop for variety of things ranging from ETL style processing and statistics generation to running advanced algorithms for doing behavioral analysis and targeting.
    • The Cluster that we use for mainly behavioral analysis and targeting has 150 machines, Intel Xeon, dual processors, dual core, each with 16GB Ram and 800 GB hard-disk.
  • ARA.COM.TR - Ara Com Tr - Turkey's first and only search engine
    • We build Ara.com.tr search engine using the Python tools.
    • We use Hadoop for analytics.
    • We handle about 400TB per month
    • Our clusters vary from 10 to 100 nodes
    • We use hadoop for information extraction & search, and data analysis consulting
    • Cluster: we primarily use Amazon's Elastic MapReduce

B


    • 4 nodes cluster (32 cores, 1TB).
    • We use Hadoop for searching and analysis of millions of rental bookings.
  • Baidu - the leading Chinese language search engine
    • Hadoop used to analyze the log of search and do some mining work on web page database
    • We handle about 3000TB per week
    • Our clusters vary from 10 to 500 nodes
    • Hypertable is also supported by Baidu
    • 14 node cluster (each node has: 2 dual core CPUs, 2TB storage, 8GB RAM)
    • We use hadoop for matching dating profiles
  • Benipal Technologies - Outsourcing, Consulting, Innovation
    • 35 Node Cluster (Core2Quad Q9400 Processor, 4-8 GB RAM, 500 GB HDD)
    • Largest Data Node with Xeon E5420*2 Processors, 64GB RAM, 3.5 TB HDD
    • Total Cluster capacity of around 20 TB on a gigabit network with failover and redundancy
    • Hadoop is used for internal data crunching, application development, testing and getting around I/O limitations
  • Bixo Labs - Elastic web mining
    • The Bixolabs elastic web mining platform uses Hadoop + Cascading to quickly build scalable web mining applications.
    • We're doing a 200M page/5TB crawl as part of the public terabyte dataset project.
    • This runs as a 20 machine Elastic MapReduce cluster.
  • BrainPad - Data mining and analysis
    • We use Hadoop to summarize of user's tracking data.
    • And use analyzing.
  • Brockmann Consult GmbH - Environmental informatics and Geoinformation services
    • We use Hadoop to develop the Calvalus system - parallel processing of large amounts of satellite data.
    • Focus on generation, analysis and validation of environmental Earth Observation data products.
    • Our cluster is a rack with 20 nodes (4 cores, 4 GB RAM each),
    • 112 TB diskspace total.

C


    • Hardware: 15 nodes
    • We use Hadoop to process company and job data and run Machine learning algorithms for our recommendation engine.
    • We use Hadoop for our internal searching, filtering and indexing
    • Hardware: 15 nodes
    • We use Hadoop to process company and job data and run Machine learning algorithms for our recommendation engine.
    • Used on client projects and internal log reporting/parsing systems designed to scale to infinity and beyond.
    • Client project: Amazon S3-backed, web-wide analytics platform
    • Internal: cross-architecture event log aggregation & processing
  • Contextweb - Ad Exchange
    • We use Hadoop to store ad serving logs and use it as a source for ad optimizations, analytics, reporting and machine learning.
    • Currently we have a 50 machine cluster with 400 cores and about 140TB raw storage. Each (commodity) node has 8 cores and 16GB of RAM.
  • Cooliris - Cooliris transforms your browser into a lightning fast, cinematic way to browse photos and videos, both online and on your hard drive.
    • We have a 15-node Hadoop cluster where each machine has 8 cores, 8 GB ram, and 3-4 TB of storage.
    • We use Hadoop for all of our analytics, and we use Pig to allow PMs and non-engineers the freedom to query the data in an ad-hoc manner.
    • Generating web graphs on 100 nodes (dual 2.4GHz Xeon Processor, 2 GB RAM, 72GB Hard Drive)
    • Crowdmedia has a 5 Node Hadoop cluster for statistical analysis
    • We use Hadoop to analyse trends on Facebook and other social networks

D


    • We use Hadoop for batch-processing large RDF datasets, in particular for indexing RDF data.
    • We also use Hadoop for executing long-running offline SPARQL queries for clients.
    • We use Amazon S3 and Cassandra to store input RDF datasets and output files.
    • We've developed RDFgrid, a Ruby framework for map/reduce-based processing of RDF data.
    • We primarily use Ruby, RDF.rb and RDFgrid to process RDF data with Hadoop Streaming.
    • We primarily run Hadoop jobs on Amazon Elastic MapReduce, with cluster sizes of 1 to 20 nodes depending on the size of the dataset (hundreds of millions to billions of RDF statements).
    • Elastic cluster with 5-80 nodes
    • We use hadoop to create our indexes of deep web content and to provide a high availability and high bandwidth storage service for index shards for our search cluster.
    • We are using Hadoop in our data mining and multimedia/internet research groups.
    • 3 node cluster with 48 cores in total, 4GB RAM and 1 TB storage each.
  • Detikcom - Indonesia's largest news portal
    • We use hadoop, pig and hbase to analyze search log, generate Most View News, generate top wordcloud, and analyze all of our logs
    • Currently We use 9 nodes
    • We generate Pig Latin scripts that describe structural and semantic conversions between data contexts
    • We use Hadoop to execute these scripts for production-level deployments
    • Eliminates the need for explicit data and schema mappings during database integration

E


    • 532 nodes cluster (8 * 532 cores, 5.3PB).
    • Heavy usage of Java MapReduce, Pig, Hive, HBase
    • Using it for Search optimization and Research.
  • Enet, 'Eleftherotypia' newspaper, Greece
    • Experimental installation - storage for logs and digital assets
    • Currently 5 nodes cluster
    • Using hadoop for log analysis/data mining/machine learning
    • 4 nodes cluster (32 cores, 1TB).
    • We use Hadoop to filter and index our listings, removing exact duplicates and grouping similar ones.
    • We plan to use Pig very shortly to produce statistics.
    • 4 nodes proof-of-concept cluster.
    • We use Hadoop in a Data-Intensive Computing capstone course. The course projects cover topics like information retrieval, machine learning, social network analysis, business intelligence, and network security.
    • The students use on-demand clusters launched using Amazon's EC2 and EMR services, thanks to its AWS in Education program.
    • We are using Hadoop in a course that we are currently teaching: "Massively Parallel Data Analysis with MapReduce". The course projects are based on real use-cases from biological data analysis.
    • Cluster hardware: 16 x (Quad-core Intel Xeon, 8GB RAM, 1.5 TB Hard-Disk)
  • Eyealike - Visual Media Search Platform
    • Facial similarity and recognition across large datasets.
    • Image content based advertising and auto-tagging for social media.
    • Image based video copyright protection.

F


    • We use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning.
    • Currently we have 2 major clusters: * A 1100-machine cluster with 8800 cores and about 12 PB raw storage.
      • A 300-machine cluster with 2400 cores and about 3 PB raw storage.
      • Each (commodity) node has 8 cores and 12 TB of storage.
      • We are heavy users of both streaming as well as the Java apis. We have built a higher level data warehousing framework using these features called Hive (see the http://hadoop.apache.org/hive/). We have also developed a FUSE implementation over hdfs.
    • 40 machine cluster (8 cores/machine, 2TB/machine storage)
    • 70 machine cluster (8 cores/machine, 3TB/machine storage)
    • 30 machine cluster (8 cores/machine, 4TB/machine storage)
    • Use for log analysis, data mining and machine learning
    • 5 machine cluster (8 cores/machine, 5TB/machine storage)
    • Existing 19 virtual machine cluster (2 cores/machine 30TB storage)
    • Predominantly Hive and Streaming API based jobs (~20,000 jobs a week) using our Ruby library, or see the canonical WordCount example.
    • Daily batch ETL with a slightly modified clojure-hadoop
    • Log analysis
    • Data mining
    • Machine learning
  • Freestylers - Image retrieval engine
    • We Japanese company Freestylers use Hadoop to build the image processing environment for image-based product recommendation system mainly on Amazon EC2, from April 2009.
    • Our Hadoop environment produces the original database for fast access from our web application.
    • We also uses Hadoop to analyzing similarities of user's behavior.

G


H


  • Hadoop Korean User Group, a Korean Local Community Team Page.
    • 50 node cluster In the Korea university network environment. * Pentium 4 PC, HDFS 4TB Storage
  • Used for development projects * Retrieving and Analyzing Biomedical Knowledge
    • Latent Semantic Analysis, Collaborative Filtering
    • 3 machine cluster (4 cores/machine, 2TB/machine)
    • Hadoop for data for search and aggregation
    • Hbase hosting
    • 13 machine cluster (8 cores/machine, 4TB/machine)
    • Log storage and analysis
    • Hbase hosting
    • 6 node cluster (each node has: 4 dual core CPUs, 1,5TB storage, 4GB RAM, RedHat OS)
    • Using Hadoop for our high speed data mining applications in corporation with Online Scheidung
    • Evolución del euribor y valor actual
    • Simulador de hipotecas en crisis económica
    • We use a customised version of Hadoop and Nutch in a currently experimental 6 node/Dual Core cluster environment.
    • What we crawl are our clients Websites and from the information we gather. We fingerprint old and non updated software packages in that shared hosting environment. We can then inform our clients that they have old and non updated software running after matching a signature to a Database. With that information we know which sites would require patching as a free and courtesy service to protect the majority of users. Without the technologies of Nutch and Hadoop this would be a far harder to accomplish task.

I


  • IBM
    • We are using Hadoop and Nutch to crawl Blog posts and later process them. Hadoop is also beginning to be used in our teaching and general research activities on natural language processing and machine learning.
    • We use hadoop for Information Retrieval and Extraction research projects. Also working on map-reduce scheduling research for multi-job environments.
    • Our cluster sizes vary from 10 to 30 nodes, depending on the jobs. Heterogenous nodes with most being Quad 6600s, 4GB RAM and 1TB disk per node. Also some nodes with dual core and single core configurations.
    • From TechCrunch:
      • Rather than put ads in or around the images it hosts, Levin is working on harnessing all the data his service generates about content consumption (perhaps to better target advertising on ImageShack or to syndicate that targetting data to ad networks). Like Google and Yahoo, he is deploying the open-source Hadoop software to create a massive distributed supercomputer, but he is using it to analyze all the data he is collecting.
    • We use Hadoop to analyze our virtual economy
    • We also use Hive to access our trove of operational data to inform product development decisions around improving user experience and retention as well as meeting revenue targets
    • Our data is stored in s3 and pulled into our clusters of up to 4 m1.large EC2 instances. Our total data volume is on the order of 5Tb
    • We use Hadoop to analyze production logs and to provide various statistics on our In-Text advertising network.
    • We also use Hadoop/HBase to process user interactions with advertisements and to optimize ad selection.
    • 30 node AWS EC2 cluster (varying instance size, currently EBS-backed) managed by Chef & Poolparty running Hadoop 0.20.2+228, Pig 0.5.0+30, Azkaban 0.04, Wukong
    • Used for ETL & data analysis on terascale datasets, especially social network data (on api.infochimps.com)
    • using 10 node hdfs cluster to store and process retrieved data on.

J


K


  • Kalooga - Kalooga is a discovery service for image galleries.
    • Uses Hadoop, Hbase, Chukwa and Pig on a 20-node cluster for crawling, analysis and events processing.
  • Katta - Katta serves large Lucene indexes in a grid environment.
  • Koubei.com Large local community and local search at China.
    • Using Hadoop to process apache log, analyzing user's action and click flow and the links click with any specified page in site and more. Using Hadoop to process whole price data user input with map/reduce.
    • Source code search engine uses Hadoop and Nutch.

L


    • Hardware: 10 nodes, each node has 8 core and 8GB of RAM
    • Studying verbal and non-verbal communication.
    • 44 nodes
    • Dual quad-core Xeon L5520 (Nehalem) @ 2.27GHz, 16GB RAM, 4TB/node storage.
    • Used for charts calculation, log analysis, A/B testing
    • 20 dual quad-core nodes, 32GB RAM , 5x1TB
    • Used for user profile analysis, statistical analysis,cookie level reporting tools.
    • Some Hive but mainly automated Java MapReduce jobs that process ~150MM new events/day.
  • Lineberger Comprehensive Cancer Center - Bioinformatics Group This is the cancer center at UNC Chapel Hill. We are using Hadoop/HBase for databasing and analyzing Next Generation Sequencing (NGS) data produced for the Cancer Genome Atlas (TCGA) project and other groups. This development is based on the SeqWare open source project which includes SeqWare Query Engine, a database and web service built on top of HBase that stores sequence data types. Our prototype cluster includes:
    • 8 dual quad core nodes running CentOS
    • total of 48TB of HDFS storage
    • HBase & Hadoop version 0.20
    • We have multiple grids divided up based upon purpose. * Hardware:
      • 120 Nehalem-based Sun x4275, with 2x4 cores, 24GB RAM, 8x1TB SATA
      • 580 Westmere-based HP SL 170x, with 2x4 cores, 24GB RAM, 6x2TB SATA
      • 1200 Westmere-based SuperMicro X8DTT-H, with 2x6 cores, 24GB RAM, 6x2TB SATA
      • Software: * CentOS 5.5 -> RHEL 6.1
        • Sun JDK 1.6.0_14 -> Sun JDK 1.6.0_20 -> Sun JDK 1.6.0_26
        • Apache Hadoop 0.20.2+patches -> Apache Hadoop 0.20.204+patches
        • Pig 0.9 heavily customized
        • Azkaban for scheduling
        • Hive, Avro, Kafka, and other bits and pieces...
  • We use these things for discovering People You May Know and other fun facts.
    • We use Hadoop to process clickstream and demographic data in order to create web analytic reports.
    • Our cluster runs across Amazon's EC2 webservice and makes use of the streaming module to use Python for most operations.
    • Using Hadoop and Hbase for storage, log analysis, and pattern discovery/analysis.

M


    • We use Hadoop to filter user behaviour, recommendations and trends from externals sites
    • Using zkpython
    • Used EC2, no using many small machines (8GB Ram, 4 cores, 1TB)
    • 18 node cluster (Quad-Core Intel Xeon, 1TB/node storage)
    • Financial data for search and aggregation
    • Customer Relation Management data for search and aggregation
    • 20 node cluster (dual quad cores, 16GB, 6TB)
    • Used log processing, data analysis and machine learning.
    • Focus is on social graph analysis and ad optimization.
    • Use a mix of Java, Pig and Hive.
    • 20 nodes cluster (12 * 20 cores, 32GB, 53.3TB)
    • Custemers log on on-line apps
    • Operations log processing
    • Use java, pig, hive, oozie
    • We use Hadoop to develop MapReduce algorithms:
      • Information retrival and analytics
      • Machine generated content - documents, text, audio, & video
      • Natural Language Processing
    • Project portfolio includes: * Natural Language Processing
      • Mobile Social Network Hacking
      • Web Crawlers/Page scrapping
      • Text to Speech
      • Machine generated Audio & Video with remuxing
      • Automatic PDF creation & IR
  • 2 node cluster (Windows Vista/CYGWIN, & CentOS) for developing MapReduce programs.
    • 18 node cluster (Quad-Core AMD Opteron 2347, 1TB/node storage)
    • Powers data for search and aggregation
  • MetrixCloud - provides commercial support, installation, and hosting of Hadoop Clusters. Contact Us.

N


    • We use Hadoop/Mahout to process user interactions with advertisements to optimize ad selection.
    • Another Bigtable cloning project using Hadoop to store large structured data set.
    • 200 nodes(each node has: 2 dual core CPUs, 2TB storage, 4GB RAM)
    • Up to 1000 instances on Amazon EC2
    • Data storage in Amazon S3
    • 50 node cluster in Coloc
    • Used for crawling, processing, serving and log analysis
    • We use Hadoop to store and process our log files
    • We rely on Apache Pig for reporting, analytics, Cascading for machine learning, and on a proprietary JavaScript API for ad-hoc queries
    • We use commodity hardware, with 8 cores and 16 GB of RAM per machine

O


  • optivo - Email marketing software
    • We use Hadoop to aggregate and analyse email campaigns and user interactions.
    • Developement is based on the github repository.

P


  • Papertrail - Hosted syslog and app log management
    • Hosted syslog and app log service can feed customer logs into Hadoop for their analysis (usually with Hive)
    • Most customers load gzipped TSVs from S3 (which are uploaded nightly) into Amazon Elastic MapReduce
  • PARC - Used Hadoop to analyze Wikipedia conflicts paper.
  • Performable - Web Analytics Software
    • We use Hadoop to process web clickstream, marketing, CRM, & email data in order to create multi-channel analytic reports.
    • Our cluster runs on Amazon's EC2 webservice and makes use of Python for most of our codebase.
  • Pharm2Phork Project - Agricultural Traceability
    • Using Hadoop on EC2 to process observation messages generated by RFID/Barcode readers as items move through supply chain.
    • Analysis of BPEL generated log files for monitoring and tuning of workflow processes.
  • Powerset / Microsoft - Natural Language Search
  • Pressflip - Personalized Persistent Search
    • Using Hadoop on EC2 to process documents from a continuous web crawl and distributed training of support vector machines
    • Using HDFS for large archival data storage
    • 4 nodes cluster (32 cores, 1TB).
    • We use Hadoop for searching and analysis of millions of bookkeeping postings
    • Also used as a proof of concept cluster for a cloud based ERP system
    • 2 nodes cluster (16 cores, 500GB).
    • We use Hadoop for analyzing poker players game history and generating gameplay related players statistics
    • 50 node cluster in Colo.
    • Also used as a proof of concept cluster for a cloud based ERP syste.
    • Multiple alignment of protein sequences helps to determine evolutionary linkages and to predict molecular structures. The dynamic nature of the algorithm coupled with data and compute parallelism of hadoop data grids improves the accuracy and speed of sequence alignment. Parallelism at the sequence and block level reduces the time complexity of MSA problems. Scalable nature of Hadoop makes it apt to solve large scale alignment problems.
    • Our cluster size varies from 5 to 10 nodes. Cluster nodes vary from 2950 Quad Core Rack Server, with 2x6MB Cache and 4 x 500 GB SATA Hard Drive to E7200 / E7400 processors with 4 GB RAM and 160 GB HDD.

Q


    • 3000 cores, 3500TB. 1PB+ processing each day.
    • Hadoop scheduler with fully custom data path / sorter
    • Significant contributions to KFS filesystem

R


    • 30 node cluster (Dual-Core, 4-8GB RAM, 1.5TB/node storage)
    • Parses and indexes logs from email hosting system for search: http://blog.racklabs.com/?p=66
  • Rakuten - Japan's online shopping mall
    • 69 node cluster
    • We use Hadoop to analyze logs and mine data for recommender system and so on.
    • 80 node cluster (each node has: 2 quad core CPUs, 4TB storage, 16GB RAM)
    • We use hadoop to process data relating to people on the web
    • We also involved with Cascading to help simplify how our data flows through various processing stages
    • Hardware: 50 nodes (2*4cpu 2TB*4 disk 16GB RAM each)
    • We use Hadoop(Hive) to analyze logs and mine data for recommendation.
    • We use Hadoop for our internal search
    • Hardware: 35 nodes (2*4cpu 10TB disk 16GB RAM each)
    • We intend to parallelize some traditional classification, clustering algorithms like Naive Bayes, K-Means, EM so that can deal with large-scale data sets.
    • Hardware: 5 nodes
    • We use Hadoop to process user resume data and run algorithms for our recommendation engine.
  • RightNow Technologies - Powering Great Experiences
    • 16 node cluster (each node has: 2 quad core CPUs, 6TB storage, 24GB RAM)
    • We use hadoop for log and usage analysis
    • We predominantly leverage Hive and HUE for data access

S


    • SARA has initiated a Proof-of-Concept project to evaluate the Hadoop software stack for scientific use.
    • A project to help develop open source social search tools. We run a 125 node hadoop cluster.
  • SEDNS - Security Enhanced DNS Group
    • We are gathering world wide DNS data in order to discover content distribution networks and configuration issues utilizing Hadoop DFS and MapRed.
    • 18 node cluster (each node has: 4 dual core CPUs, 1TB storage, 4GB RAM, RedHat OS)
    • We use Hadoop for our high speed data mining applications
    • We have a core Analytics group that is using a 10-Node cluster running RedHat OS
    • Hadoop is used as an infrastructure to run MapReduce (MR) algorithms on a number of raw data
    • Raw data ingest happens hourly. Raw data comes from hardware and software systems out in the field
    • Ingested and processed data is stored into a relational DB and rolled up using Hive/Pig
    • Plan to implement Mahout to build recommendation engine
    • 14 node cluster (each node has: 2 dual core CPUs, 2TB storage, 8GB RAM)
    • We use hadoop to process log data and perform on-demand analytics
    • We are developing the MrGeo (Map/Reduce Geospatial) application to allow our users to bring cloud computing to geospatial processing.
    • We use HDFS and MapReduce to store, process, and index geospatial imagery and vector data.
    • MrGeo is soon to be open sourced as well.
    • Hosted Hadoop data warehouse solution provider
    • We use HBase to store our recommendation information and to run other operations. We have HBase committers on staff.

T


  • Taragana - Web 2.0 Product development and outsourcing services
    • We are using 16 consumer grade computers to create the cluster, connected by 100 Mbps network.
    • Used for testing ideas for blog and other data mining.
  • The Lydia News Analysis Project - Stony Brook University
    • We are using Hadoop on 17-node and 103-node clusters of dual-core nodes to process and extract statistics from over 1000 U.S. daily newspapers as well as historical archives of the New York Times and other sources.
  • Tailsweep - Ad network for blogs and social media
    • 8 node cluster (Xeon Quad Core 2.4GHz, 8GB RAM, 500GB/node Raid 1 storage)
    • Used as a proof of concept cluster
    • Handling i.e. data mining and blog crawling
    • Generating stock analysis on 23 nodes (dual 2.4GHz Xeon, 2 GB RAM, 36GB Hard Drive)
    • Collection and analysis of Log, Threat, Risk Data and other Security Information on 32 nodes (8-Core Opteron 6128 CPU, 32 GB RAM, 12 TB Storage per node)
    • We use Hadoop in our data mining and user modeling, multimedia, and internet research groups.
    • 6 node cluster with 96 total cores, 8GB RAM and 2 TB storage per machine.
    • 60-Node cluster for our Location-Based Content Processing including machine learning algorithms for Statistical Categorization, Deduping, Aggregation & Curation (Hardware: 2.5 GHz Quad-core Xeon, 4GB RAM, 13TB HDFS storage).
    • Private cloud for rapid server-farm setup for stage and test environments.(Using Elastic N-Node cluster)
    • Public cloud for exploratory projects that require rapid servers for scalability and computing surges (Using Elastic N-Node cluster)
    • We use Hadoop for log analysis.
    • We use Hadoop HDFS, Map/Reduce, Hive and Hbase
    • We manage over 300 TB of HDFS data across four Amazon EC2 Availability Zone
    • We use Hadoop for searching and indexing
    • We use Hadoop to store and process tweets, log files, and many other types of data generated across Twitter. We use Cloudera's CDH2 distribution of Hadoop, and store all data as compressed LZO files.
    • We use both Scala and Java to access Hadoop's MapReduce APIs
    • We use Pig heavily for both scheduled and ad-hoc jobs, due to its ability to accomplish a lot with few statements.
    • We employ committers on Pig, Avro, Hive, and Cassandra, and contribute much of our internal Hadoop work to opensource (see hadoop-lzo)
    • For more on our use of hadoop, see the following presentations: Hadoop and Pig at Twitter and Protocol Buffers and Hadoop at Twitter
    • We use Hadoop to assemble web publishers' summaries of what users are copying from their websites, and to analyze user engagement on the web.
    • We use Pig and custom Java map-reduce code, as well as chukwa.
    • We have 94 nodes (752 cores) in our clusters, as of July 2010, but the number grows regularly.

U


    • 5 node low-profile cluster. We use Hadoop to support the research project: Territorial Intelligence System of Bogota City.
    • 30 nodes cluster (Xeon Quad Core 2.4GHz, 4GB RAM, 1TB/node storage). We use Hadoop to facilitate information retrieval research & experimentation, particularly for TREC, using the Terrier IR platform. The open source release of Terrier includes large-scale distributed indexing using Hadoop Map Reduce.
    • We are one of six universities participating in IBM/Google's academic cloud computing initiative. Ongoing research and teaching efforts include projects in machine translation, language modeling, bioinformatics, email analysis, and image processing.
    • We currently run one medium-sized Hadoop cluster (1.6PB) to store and serve up physics data for the computing portion of the Compact Muon Solenoid (CMS) experiment. This requires a filesystem which can download data at multiple Gbps and process data at an even higher rate locally. Additionally, several of our students are involved in research projects on Hadoop.
    • 10 nodes cluster (Dell PowerEdge R200 with Xeon Dual Core 3.16GHz, 4GB RAM, 3TB/node storage).
    • Our goal is to develop techniques for the Semantic Web that take advantage of MapReduce (Hadoop) and its scaling-behavior to keep up with the growing proliferation of semantic data.
    • RDFPath is an expressive RDF path language for querying large RDF graphs with MapReduce.
    • PigSPARQL is a translation from SPARQL to Pig Latin allowing to execute SPARQL queries on large RDF graphs with MapReduce.

V


    • We use a small Hadoop cluster to reduce usage data for internal metrics, for search indexing and for recommendation data.
    • We use a Hadoop cluster to for search and indexing for our projects.
  • Visible Measures Corporation uses Hadoop as a component in our Scalable Data Pipeline, which ultimately powers VisibleSuite and other products. We use Hadoop to aggregate, store, and analyze data related to in-stream viewing behavior of Internet video audiences. Our current grid contains more than 128 CPU cores and in excess of 100 terabytes of storage, and we plan to grow that substantially during 2008.
    • We use a small Hadoop cluster in the scope of our general research activities at VK Labs to get a faster data access from web applications.
    • We also use Hadoop for filtering and indexing listing, processing log analysis, and for recommendation data.

W


    • We use Hadoop for our internal search engine optimization (SEO) tools. It allows us to store, index, search data in a much faster way.
    • We also use it for logs analysis and trends prediction.
    • Hardware: 44 servers (each server has: 2 dual core CPUs, 2TB storage, 8GB RAM)
    • Each server runs Xen with one Hadoop/HBase instance and another instance with web or application servers, giving us 88 usable virtual machines.
    • We run two separate Hadoop/HBase clusters with 22 nodes each.
    • Hadoop is primarily used to run HBase and Map/Reduce jobs scanning over the HBase tables to perform specific tasks.
    • HBase is used as a scalable and fast storage back end for millions of documents.
    • Currently we store 12million documents with a target of 450million in the near future.

X


Y


    • More than 100,000 CPUs in >40,000 computers running Hadoop
    • Our biggest cluster: 4500 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM)
      • Used to support research for Ad Systems and Web Search
      • Also used to do scaling tests to support development of Hadoop on larger clusters
    • Our Blog - Learn more about how we use Hadoop.
    • >60% of Hadoop Jobs within Yahoo are Pig jobs.

Z


  • 10 node cluster (Dual-Core AMD Opteron 2210, 4GB RAM, 1TB/node storage)
  • Run Naive Bayes classifiers in parallel over crawl data to discover event information

FEEDJIT Live Traffic Feed