Error ID 1 - Wrong File Type

Modified on Fri, 3 May, 2019 at 1:36 PM

Error Description

You received this error because of one of two reasons:

You uploaded an unrecognized file format
You tried to submit an analysis with a sample that contains a file type that is inappropriate for the workflow.

Solutions

Solution 1

If you got the file type error when trying to start an analysis, double check the contents of your files to verify what type of data it contains. The analysis error message should have indicated to you what type of files it accepts. See "Further information" below for formats of common filetypes used by Basepair.

Solution 2

If you are sure your files contain the right type of data, rename the files to have an appropriate file suffix and re-upload the files. See "Further information" below for accepted suffixes for common filetypes used by Basepair.

Solution 3 - contact us

If you still cannot resolve your issue, please don't hesitate to contact us by either:

Creating a ticket here: http://support.basepairtech.com/support/tickets/new
Message us using the chat icon in the lower right corner of your screen on the Basepair dashboard.

Further information

Basepair deals with various types of files. Usually the type of file is indicated by the suffix (e.g. ".csv" for a comma-separated file). Moreover, due to the size of the files we deal with, they are usually compressed in some way. Here are some common suffixes that indicate a file is compressed:

.gz
.zip
.tar.gz
.bz2

Unfortunately, we frequently see cases where the file suffix does not match what the actual file type is, or the file extension indicates it is compressed but it actually is not. Below we go over the common file types Basepair deals with:

FASTQ

This file type contains sequence reads and usually looks like this:

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

There are four lines for each read:

Line 1: contains the name of the read
Line 2: contains the actual read content in A's, T's, C's, and G's
Line 3: Which strand the read is from
Line 4: The per-base quality scores

You can read more about FASTQ file format here: https://en.wikipedia.org/wiki/FASTQ_format

Common file suffixes for FASTQ files are:

.fastq
.fq
And compressed variations like: .fq.gz, .fq.zip, .fastq.gz, etc.

The following are also sometimes used:

.txt (and its compressed variations)
.sra (which is a special compressed format from the sequence read archive)

You can open FASTQ files in a text editor to check their contents, but be warned FASTQ files are often very large.

FASTA

This file type contains DNA, RNA, and protein sequences. Reference genomes and transcriptomes are some examples of data stored in FASTA files. An example is below:

>Sequence 1
ATCGATCGTATTTCTCTTTAAACGGTATGCTA
>Sequence 2
TTTTGCGCGTTCTTAGGCTTGCTATCTC

There are two lines for each sequence:

Line 1: The sequence name prefixed with a ">"
Line 2: The actual sequence (whether it be DNA, RNA, or protein)

You can read more about the FASTA format here: https://en.wikipedia.org/wiki/FASTA

Here are some common file suffixes for FASTA files:

.fasta
.fa
And compressed variations like: .fasta.gz, .fa.zip, etc.

You can open FASTA files in a text editor to check their contents.

BAM/SAM

Standing for Sequence Alignment/Map (SAM) and its compressed form (BAM), stores reads aligned to a reference sequence. An example two lines are shown below:

GWNJ-0965:327:GW1811231642:8:1103:14407:23987  163  KmetStat_WT_A  1  1  112M  =  1  -112  GAAACGCGTA  AAFFFJJJJJ  AS:i:0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:112  YS:i:0  YT:Z:CP
GWNJ-0965:327:GW1811231642:8:1103:10287:72367  99  KmetStat_WT_A  1  34  135M  =  1  -135  GAAACGCGTA  AAFFFJJJJJ  AS:i:0  XS:i:-35  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:135  YS:i:0  YT:Z:CP

Each line contains information for one sequence read such as is mapping quality, mate pair information, etc. You can read more about the BAM/SAM format here: https://samtools.github.io/hts-specs/SAMv1.pdf

Here are some common file suffixes for BAM/SAM files:

.bam for BAM file
.sam for SAM file

You can open SAM files in a text editor, but be warned these files are often several to tens of gigabytes large. To check the contents of a BAM file you need specialized software (like samtools). You can also try opening it in Integrative Genomics Viewer.

VCF

This file type contains variants called for one or more samples (e.g. SNPs, indels, somatic mutations). This is a more complicated file format and (unfortunately) different tools produce slightly different variations on it. An example is below:

##fileformat=VCFv4.3
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS      ID         REF   ALT    QUAL  FILTER   INFO                             FORMAT       NA00001         NA00002          NA00003
20     14370    rs6054257  G     A      29    PASS    NS=3;DP=14;AF=0.5;DB;H2           GT:GQ:DP:HQ  0|0:48:1:51,51  1|0:48:8:51,51   1/1:43:5:.,.
20     17330    .          T     A      3     q10     NS=3;DP=11;AF=0.017               GT:GQ:DP:HQ  0|0:49:3:58,50  0|1:3:5:65,3     0/0:41:3
20     1110696  rs6040355  A     G,T    67    PASS    NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ  1|2:21:6:23,27  2|1:2:0:18,2     2/2:35:4
20     1230237  .          T     .      47    PASS    NS=3;DP=13;AA=T                   GT:GQ:DP:HQ  0|0:54:7:56,60  0|0:48:4:51,51   0/0:61:2
20     1234567  microsat1  GTC   G,GTCT 50    PASS    NS=3;DP=9;AA=G                    GT:GQ:DP     0/1:35:4        0/2:17:2         1/1:40:3

You can read more about the VCF format here: https://en.wikipedia.org/wiki/Variant_Call_Format

Here are some common file suffixes for FASTA files:

.vcf
And compressed variations like: .vcf.gz, .vcf.zip, etc.

You can open VCF files in a text editor or even excel to view its contents.