Background: Even with top notch architecture, coding, and Quality Assurance (QA), it’s easy to make these simple mistakes, that once introduced in to production, can be quite costly to fix.
Production issues can cause down time, brand / reputation impact, loss of customer / end-user confidence, loss of productivity, loss of revenue, and wasted resources.
This article describes some of the more common mistakes that can be made, going from development in to production.
1. Mistake #1: “It worked on my machine.”
Make sure dependent components are identified and bundled with your installation package.
You hit “compile”, you get a couple of warnings, no errors, and…. done! You run the app, click, click, everything seems to work OK, so you push the new version in to production.
Within minutes, customers or end-users start calling you because they are getting an error message about a missing component.
First, let me state up front, most development shops have a QA function that’s supposed to catch dependencies. There are also many self-maintained or single-sourced software projects out there, that either start off as hobby code, or garage projects that enjoy limited commercial release. Full regression testing should always be part of the release management process.
That said, development environments are often specially-crafted and highly-personalized, to facilitate the development and coding process. There are often complete installs of development tools, such as Microsoft Visual Studio, that come with libraries and other components that normal end users might not have installed on their PCs. Sometimes, there are 3rd-party tools and utilities that you might be using, without even realizing it.
The best approach is to make sure you have a “clean build” environment where you test new builds. Be religious about clean-testing new builds, and keep a careful list of platform and component dependencies.
Virtualization can be used for “clean build” environments – tools such as Oracle VirtualBox and Microsoft Hyper-V are tools that are free to use under specific circumstances, and support disk imaging, so that at the end of a test, you can effectively “snap back” or delete any changes that have been made.
The installation process should be single-click – the installer should bundle redistributable versions of every dependent package or platform.
There’s nothing more frustrating than trying to install some software that requires that you download half the internet to get the thing working!
Bundling dependencies ensures that your customers and end-users get exactly what components they need, and the correct versions that your software requires.
When people are waiting on you, “it worked on my machine” is not an excuse! Clean-build testing, and bundling dependencies can help you make sure no excuse is required.
2. Mistake #2: “If I can use it for free, then it’s free”
Make sure you and your customers / end-users comply with 3rd-party Licensing.
Ironically, developers, who ostensibly make a living writing code, are often the worst at understanding and respecting the work of other developers.
This mistake comes in two flavors: tools and components that are included with the development environment, and 3rd-party open-source / shareware / tools.
Often, development environments come with “full install” versions of software that might be a component that other people have to license independently. One example is MS SQL server – there is a single-use version of SQL called MS SQL Express, but bundling the full version means that the customer or end-user is responsible for a full SQL license!
In some cases, if you run a hosted environment, you might require a special license. For example, Microsoft requires a Service Provider License Agreement (SPLA) for environments that are “generally-licensed”, meaning anyone can connect to them.
Likewise, “open source” tools and components might be free to use under certain conditions, but might require special licensing if you sell your software commercially.
Other restrictions that are typical in GPL, LGPL, GPL-2 licensing might include:
- Enterprise licensing: Although free for personal use, some components / tools might require fee-based licensing, or use might be prohibited inside a company.
- Reseller licensing: Even though “GPL” means “open source”, there might be a fee for reselling someone else’s code, or using their code in your project might be prohibited.
- Commercial use: Some tools and components are free to use personally, but might require special licensing if you use them to develop software for resale. Often, this is called the “cobbler test”: If your company makes shoes (for internal use), you’re OK. If you SELL shoes (externally), you need a license.
Failure to comply with licensing requirements could result in exorbitant license fees, fines, a lawsuit, or even jail time!
To make sure neither yourself nor your clients are in for an unexpected surprise, ALWAYS read license agreements – make sure you know what components are truly “free” to use, or might require special licensing for yourself or your clients. Enter in to license agreements where appropriate, and work with your key vendors to establish strategic agreements that provide reduced costs (and therefore pricing) as well as enhanced usage rights for yourself and your customers.
The purpose of licensing is to ensure that the developer’s rights are protected — if you use someone else’s tools or components, you have to abide by the terms of their license.
3. Mistake #3: Fire and Forget
Make sure every transaction can fail safely.
Whether your application writes something to disk, or sends data over a network, make sure that you account for the possibility that things could break.
The development environment is often an ideal world, with lots of LAN-connected components, as well as lots of bandwidth and computing resources. The real world often runs on old, slow hardware, and flaky, slow networks.
- Identify atomic transactions. If components A, B, and C are all part of one transaction, have the code ensure that A, B, and C all get committed together, or they all fail together. Nothing is worse than trying to deconstruct a partial transaction from its components. Think of it this way: If you work at a hospital, a patient could get dosed twice, or if you work at a bank, a check could get posted twice. You don’t want YOUR CODE to be the reason someone dies or goes broke. If your application copies a file AND updates a database, make sure BOTH happen or NEITHER happen.
- Anything outside your code might fail. If you call a third-party component, write a file to disk, transmit data on a network, or write to a database, anticipate an error condition. Make sure the application recovers smoothly – alert the administrators, give the end-users a friendly message, and preserve the data if possible.
- Use WAN testing tools – there are several available, that simulate faulty or flaky networks. If your program runs well under difficult conditions, things like an internet glitch won’t cause your application to crash, or lose data.
- Make sure your application can write to a backup (failover) component. For example, if you send data to a database server, or read data from a DNS, make sure your application switches over to a pre-configured backup database or DNS automatically!
- Disk failures happen! Network storage, SAN, as well as local disks can fail, often at the worst possible time. Make sure your application allows for writing to a backup location in the event that primary storage fails.
- Timeout. Every transaction should have an absolute timeout, after which, the transaction is agreed by the server, broker, remote system, and client, to NOT have been committed. The timeout should be less than the user session… the user should receive a positive error message that the transaction failed, rather than an ambiguous timeout message.
- Clean crash recovery. Whether your application crashes, the operating system crashes, or maybe the administrator terminates your program incorrectly, always make sure you start your application with a post-execution cleanup process. The most common example is Java processes that leave behind a PID file, and then can’t execute! On startup, make sure your app cleans up everything from the previous execution, including temp files (PID files if applicable), re-registers cleanly with remote servers, terminates any zombies (unattached processes from the previous execution), and frees any local or remote resources. If a transaction has multiple states, keep track of transaction states, to ensure that transactions don’t get duplicated or ignored.
A fast, easy restart means less down time, and data integrity is the most important thing to your customers and end-users.
(In my best Yoda voice) Always plan to fail, and you will always succeed.
4. Mistake #4: Scaling Issues
Scale-out means running multiple instances, while scale-up means running a bigger instance. Scaling an application for a production environment can mean both!
4.1. Scale-out presents several unique issues:
- Inter-process communication. If you have multiple instances of a process, they all need to communicate with each other (Inter-Process Communication, or IPC) so that they know which instance is handling what work. I’ve seen many applications where the task server is the bottleneck, because you can’t run multiple task servers!
- Session awareness. Session awareness means that an app or web server can fail, and the user’s session persists, failing over seamlessly to another instance. Saving the user’s session means saving frustration.
- Infrastructure capacity. Core, infrastructure components such as file and database servers, might be overrun by scale-out, requiring that infrastructure components need to handle more and more concurrent connections – this means slower transaction times, decreased reliability, greater memory and resource footprint, longer disk queues, and a greater level of concurrency for your application if infrastructure transactions fail to complete in a timely manner (imagine “add to cart” taking 10 minutes on Amazon.com — I guaranty you’ll shop somewhere else!). The best approach is to plan from day 1 to use multiple databases, file servers, and other core resources. Using pointers to infrastructure resources (pointer databases, UNC paths to file servers) allows core resources to scale quickly and easily with the application growth.
- Application Delivery. Formerly referred to as load-balancing, understand and know how to leverage application delivery. App delivery can route traffic to data centers with excess capacity, or route around failure. The closer your application integrates with the app delivery tier, the more reliable and persistent your application service will appear.
- N-tier versus X-tier. N-tier means that all application components run on one box. In reality, most production environments split the various tiers out to separate systems (x-tier). Each tier needs to be able to communicate with the next tier — possibly through an App Delivery layer, or perhaps the application has its own method for resource allocation.
- N-squared. This is my favorite scaling problem. Assume two application tiers, “A”, and “B”. If every “A” node must maintain a connection to every “B” node (called “full mesh”), then the connections have an “n-squared” relationship — for the number of nodes, the number of connections approaches n^2 (n squared). Brokers ensure efficient communication between app tiers. App delivery / load balancing can be used for brokering connections between tiers, or the app might have its own load-balance algorithm.
- Selection bias. This is my second favorite scaling problem. If you have “n” nodes, selection bias means you always start with node 1, then move to node 2. This means that by “selecting” node 1 first (etc….), node 1 gets 30+% of the traffic, while node “n” never gets used! If you have a load-balance algorithm or leverage an app-delivery tier, make sure you avoid selection bias. Maintain state external to the session to ensure that new sessions start where the old one left off, or use random selection for the initial node.
- N-squared and selection bias go hand-in-hand. Beware.
4.2. Scaling up has its own set of challenges:
- Memory footprint. Frameworks such as Java and Dot Net are subject to underlying OS limitations — for example, a 32-bit process in Windows is limited to 2 gig. If your server has 20 gig, and your application can only use 2, you need to re-think your strategy! Run multiple instances on the same machine, or run another version, such as 64-bit versus 32. Make sure you use system resources efficiently.
- Thread allocation. Using 2 CPUs efficiently is quite a bit different than using 32 CPUs efficiently! Modularized, multi-threaded code ensures that as the application scales up, system resources can be used efficiently.
- Paging. Virtual memory means that your application might be using virtual resources. Keep track of timing, and make administrator recommendations to increase physical memory, to prevent paging out to disk. A disk call is in the 10ms range, while a memory call is in the 10 nanosecond range! Avoid paging, which is death to your application.
- Storage and Network IO. As your application scales, having fewer Input/Output (IO) paths means that disk and network writes could take longer. “Queue length” is an indication of IO taking too long… the longer the queue, the more IO is waiting “in queue” to be processed by the appropriate subsystem. High kernel usage can also be an indication of slow IO, as most IO is handled by the kernel. Plan in advance for slow IO. Monitor, and send an administrative alert recommending increased IO capacity.
Planning from day 1 to accommodate multiple instances and large instances ensures that your application will run smoothly in a large-scale production environment.
5. Mistake #5: Compliance Issues
Production environments often have compliance requirements based on the type of data they store, transmit, or manage.
Here are some examples of sector-specific regulatory requirements:
- PCI – Payment Card Industries. If your application accepts, stores or transmits credit card data, it’s subject to PCI.
- GLB – If your application runs inside a bank environment, or processes online transactions for banks, it’s subject to the Gramm-Leach-Bliley act.
- HIPA – If your application runs in a doctor’s office, hospital, insurance, or other medical environment, it’s subject to the Health Information Portability and Accountability act.
- OWASP top 10 – Web-based applications should identify and actively avoid the OWASP top 10 list of vulnerabilities.
All of these requirements have unique privacy and security standards. Writing a sector-specific application means conforming to sector-specific requirements – educate yourself about the requirements and how to comply with them.
General guidelines for secure coding:
- All applications should comply with secure coding practices.
- All applications should assume that every transaction is monitored, and an attacker might try to compromise them.
- Passwords should be hashed, not stored in cleartext, nor encrypted. A hashed password can’t be extracted and used elsewhere.
- Unchecked buffers are a potential memory exploit.
- Have input validation rules defined for all input, and validate all input
- Authenticate every transaction. Even in the absence of encryption, secure hashing and other forms of authentication can be used to ensure that transactions and transaction data are legitimate.
Use encryption where feasible.
- Communication between tiers should use Secure Sockets Layer (SSL) or Transport Layer Security (TLS) where possible.
- SSL and TLS authenticate the servers to each other, and encrypt data transmitted between them.
- Data stored on a file system should be encrypted using the operating system’s native libraries.
- Databases can often be configured for native encryption, so that only certain users have access to certain fields. Any sensitive field should be encrypted.
Beware query by form, and never pass raw SQL from the web tier to the core.
- Query by form is the Achilles heel of any application. If you throw up a form with date and customer ID, HOW could that possibly be exploited? What if the user hacks the URL, adding “OR 1=1”? If you pass raw SQL to the database, this will return all rows.
- Tables and other objects should be aliased to prevent exploitation.
- Don’t pass whole or partial SQL — no WHERE, HAVING, GROUP BY, or ORDER BY clauses.
Understand, respect, and conform to infrastructure security controls:
- Firewall: Applications should use specifically-defined ports and well-known endpoints between tiers. Port ranges are difficult to define, frowned upon, and leave room for exploitation.
- Intrusion Detection / Prevention: Passing raw SQL, or using well-known application ports (such as TCP/1433) might trigger intrusion prevention, causing your application not to run properly.
- Antivirus: Scanning certain file types, such as zip files, can cause excessive overhead, resulting in corrupt files, failed transactions, and unreliable applications. Use standard file formats that can be easily excluded, and make administrators aware of application-related file formats and requirements. Scanning large files (such as database files) should ALWAYS be excluded.
- XML Gateway / Datawall: This type of device is configured to allow only certain types of transactions in to or out of your network, and to limit the amount of data returned in a single transaction.
By defaulting to a secure posture, you help ensure maximum protection for your customers and users, while avoiding potentially costly compliance pitfalls.
6. Mistake #6: Platform Bloat
There’s nothing like having to install 100 gig of platform files for 200 lines of code.
- Be aware of your core platform. Dot net and Java are both guilty of this — the promise that “managed code” is fast and efficient. If your Grandma, on her Pentium M laptop had to install a current Java or Dot Net platform base to run your app, she might disagree about both! From experience, I can tell you that there’s nothing I hate more — this is like the $40 slice of pizza scenario — I hate trying to install an app or utility that gleefully expresses, “.Net Framework 4.5 is required!!” (in mock triumph).
- Either stick to the most common versions, or stay version independent. The only thing worse than finding out I have to install Java or Dot Net, is finding out the three versions I already have installed aren’t sufficient. Find out what your user base already primarily has installed, and conform to the majority.
- Evaluate alternatives. For lightweight uses, investigate other options that are purpose-specific, and may be smaller or more efficient. Python, PERL, BASIC, and C can all be compiled to native executable format, using various 3rd-party tools and utilities. Some installer toolkits can be used as executable batch files.
- Keep your code clean, lean, and mean. Don’t include libraries, options, or utilities you don’t need. These translate in to dependencies that you don’t need, and bloat that your users don’t want.
- I worked with a guy one time who downloaded and used a 3rd-party grid control, because he liked the way it looked over the MS Common Controls grid control. The difference? Every user had to individually download and accept the license for this 3rd party control, instead of using the Microsoft-supplied one that they already had on their computer!
- In another situation, a developer used a different 3rd-party grid control, because he could implement it with less effort, only to find out that the grid control was uploading all the grid data to a 3rd-party website.
- In a third situation, a VB developer I worked with, had his default project set up to bind to all of the ActiveX controls that shipped with VB. Even though his code didn’t use everything, all of these controls had to be included as part of the installation – one missing library would cause the program to fail to execute.
Keep things lean, simple, and small, and your users and administrators will love you for it.
7. Mistake #7: Insufficient Error Handling, Logging, and Diagnostics
The biggest problem you’ll have as a developer, is having to remotely support a customer / end-user, while having insufficient metrics and diagnostics.
7.1. Error handling ensures that your application is robust, and can survive unexpected input, data, network conditions, and environmental conditions.
- Every application and non-application call should anticipate an error. Build error handling in to every function call.
- Channel-specific calls (such as database, file storage, etc…) should be handled by a channel-specific handler. There is NOTHING more frustrating than a generic error message.
- Redundancy should be applied where feasible. Every connectivity-related configuration item should include a standby or failover configuration, that the application automatically tries to leverage in the event of a primary failure.
- Every failed transaction should be survivable. Set user expectations, send an administrative alert, then DEAL with it!
7.2. Logging should be configurable, but robust.
- Non-repudiation. Every transaction should be logged (at all logging levels) and authenticated via secure hash. With atomic transactions, there should never be a question of whether a transaction occurred or not.
- Anything returned to the client should be logged. Error codes, return codes, status codes. If the client receives an unexpected result, this can be traced to both the return code, and the internal diagnostics.
- Debug logging should be available, detailing the entire stack (all function calls), but not enabled by default.
- All log entries should be time stamped using a coordinated time source. Typically, “network time” is obtained via the operating system through NTP (Network Time Protocol).
- Logging to external sources should be supported. This includes SYSLOG and other logging protocols, in support of centralized logging and event correlation.
7.3. Diagnostics should be built in to the platform or infrastructure, and should provide meaningful metrics:
- Is a certain bandwidth required? Include a bandwidth test in your application. Send a fixed amount of data, then have a client script send it back. Time both transactions. Data-in-bytes * 8 / seconds = available bandwidth. One of the biggest mistakes I see is that tech support sends the end user to a 3rd-party bandwidth test site. Just because the user has a decent bandwidth test result to a 3rd-party site, doesn’t mean the instantaneous bandwidth available to your application is sufficient — the only way to be sure is to host a lightweight bandwidth test app on YOUR website.
- Component dependencies? Run a local diagnostic. If your end-user has already accepted and installed an ActiveX or Java control, this is an excellent opportunity to ensure that system pre-reqs are met, or upload logging information to the server if not.
- Time every transaction. Building a histogram by day and time of day means that you have a library of “normal” for every transaction. Abnormal transactions should generate an administrative warning, a message to the end user, and some kind of affirmative transaction disposition (so that the user knows what happened).
- Be sure to log normal and abnormal timing metrics. This allows an administrator to check whether things are working correctly. Metrics / statistics should be gathered from server AND client for every major application function.
- Automatically upload workstation diagnostic logs. If the workstation had an error, or timed out for some reason, or perhaps received an unexpected status code, upload every detail possible.
7.4. A simple test harness can be co-developed with the application for functional and load testing, as well as remote monitoring
- A test harness allows you to simulate various types of transactions.
- For web-based applications, the test tool can use a browser object instance to programmatically enter data, manipulate controls, and simulate XML or web calls.
- For thick-client applications, a debugger “side-pipe” interface allows simulated events to be configured and fired from a test file, or sent via a network interface.
- Load testing means simulating many concurrent transactions in to the app server. A script-able test harness and debugging interface makes this easy, with no 3rd-party tools required!
- A debugging interface can be used for remote monitoring. Simulated transactions can be timed by periodically firing off specific scripts from the test harness.
- Always protect debugging interfaces! Any connection should be authenticated, and there should be an option to disable all debugging in production.
8. Mistake #8: No vendor support
Anticipate that you will need vendor support. Your customers / end-users will call you, and if you require platform support, you will need to call your vendors.
- Make sure that platform and components are kept current. Vendors don’t provide support for End Of Life (EOL) components.
- Make sure there is a 1st party or 2nd party support agreement in place. Some vendors require a 2nd party agreement directly with the client, while others allow you to resell their components (requiring you to broker support). Either way, anticipate your users / clients having a platform issue that you’ve never seen, in a critical situation, in the middle of the night, and be prepared to deal with it!
Keep technology components and vendor support contracts current, to ensure that you can provide critical support to your customers.
9. Mistake #9: No community support
Community support means vetting new features and functions with your user base, in advance of general release.
- Foist. Foisting means you thrust a new feature or function upon your user base without their approval. Users generally don’t like this. Like… taking away the “Start” button.
- Public Betas. Helps fix the “It runs on my machine” error. Your user base has a wider variety of hardware and software components than you do, and a public beta can help identify and sort out errors much faster than lab testing, in a relatively controlled environment. End users are also a never-ending source of spontaneous, unanticipated input (take that however you want), that may find holes in your error detection and resilience approach.
- Market Demand. Opposite of foisting, allowing your beta users to suggest new features, means staying ahead of market demand, thus making your product more marketable.
The purpose of community support is to make sure that development effort is in alignment with market demand, and to ensure that new features / functions are regression tested across a large user base.
10. Mistake #10: Production will never have problems.
Planning in advance for a site-wide disaster (disruptive event) ensures that data is current (Recovery Point Objective), and that the Disaster Recovery (DR) site can be brought up in a timely manner (Recovery Time Objective).
- Plan for a mirror Disaster Recovery (DR) site, with equal capacity and bandwidth. Often, the mistake made with DR is to use older equipment or have insufficient bandwidth in place – in the event of a disaster, DR becomes your production site, and it should be treated like production.
- Replicate transactions to a mirror server, where feasible. Anticipate that the connection to a Disaster Recovery (DR) servers might be across a slower wide-area connection, and transactions might queue up. One approach is to use a 2nd local server to buffer the transactions.
- Have a master copy of all software components and installation keys set aside and copied to your DR site. There is nothing like trying to find a license key in the middle of a disaster.
- Plan for high availability at Production and DR.
- Servers can fail. They can have hardware problems, OS problems, or infrastructure problems outside of your control. If you have ONE server, and it fails, what is that down time going to cost you?
- Designing for active-active high availability means having multiple session-aware servers that all share the load.
- There should be enough servers in play, such that you can lose some predetermined part of your capacity, and maintain performance levels. If you have 2 active servers, losing 1 means 50% capacity reduction. Can your business run on 50% capacity?
Plan for upgrades
- As the application footprint grows, making codebase updates in production becomes a greater concern
- Plan for new codebase versions to be compatible with existing configuration and data. New features and updates can be enabled with a final configuration change, once all app servers are at the same codebase level.
- Use modular design, so that one module or group of functions can be updated independent of the rest of the application.
- Using high-availability, update a few app servers at a time to the new codebase during non-peak times.
- Consider performing a site-level upgrade in DR, and switch to DR for production while you update the “normally production” side.
11. Mistake #11: Keep Everything!
Have a data retention plan, and implement data purge routines.
- Database and file system growth can lead to performance issues
- In addition to performance issues, keeping unnecessary data could put you or your customers at risk, if there is a data breach.
- Every transactional data table or file system should have a purge plan
- Purge scripts can be driven at the app, database, or OS level, but should be configurable within the application.
- From a capacity planning standpoint, make sure you have database and file storage calculators to help administrators figure out what resources will be used by the application.
- I typically try to provide a sizing spreadsheet, where the client can plug in some variables, to forecast what types of transactions and how many they will make, and predict storage and network capacity requirements accordingly.
12. Mistake #12: Reporting is an afterthought.
Reports can cripple your application.
Trying to run reports on a highly-transactional database means lots of locking. One misplaced, long-running report can cause enough locking to prevent new transactions from entering the database!
- Plan to integrate with a BI / Reporting Tool – create schemas and views that are easy to read by semi-skilled report-writers (analysts)
- Have controls in place to prevent locking key transactional tables. Views are a great way to prevent this — views can be configured with the appropriate isolation level.
- Choosing an appropriate isolation level means that your query can accept data that may be out of date (known as ‘dirty read’ or ‘uncommitted read’). If you have a long, scary report that runs for 4 hours, at the default isolation level, it will try to lock resources in various tables in order to maintain a “consistent picture” from a transaction standpoint. By using a slightly less aggressive locking strategy, you can improve performance and reduce impact to waiting transactions.
- Plan in advance for a separate report repository. Ensuring that your reporting engine can run from a read-only copy of the database, means that all day 2 reporting can be run from a replicated copy of the database, with no impact to the transactional primary copy.
- Do clean-build testing to ensure that your application works everywhere, not just on your dev system
- Understand licensing requirements. Obtain the proper licenses where applicable, or make sure you bundle 3rd-party licenses if needed.
- Assume every transaction might fail. Make sure your app never commits a partial transaction. Allow your app to support multiple connection points, and have it route around failed components. Allow for timeouts and retries, assuming your network or environment might be slow or flaky.
- Plan for scaling from day 1. Figure out how each app tier might be serviced by multiple instances, and how they will communicate. If your platform or OS has inherent limitations, figure out how your app will use system resources more efficiently on larger systems.
- Understand compliance issues based on the data your app receives, stores, processes, or transmits. Understand and work with infrastructure security mechanisms. Beware query by form.
- Avoid platform bloat. Use only the features, components and objects that you really need. Investigate smaller platforms that might be purpose-specific, but suited to your need.
- Build robust error handling, lots of logging, and relevant diagnostics directly in to the application.
- Maintain vendor support contracts.
- Deliver features and functionality your user community wants, and give them an opportunity to test it
- Plan in advance for high availability (HA) and disaster recovery (DR)
- Have a built-in retention policy and data purge / cleanup mechanism
- Plan in advance for reporting
Click here to read Part 2:
Fixed a few typos and formatting errors.