The detection and analysis of transient astronomical sources is of great importance to understand their time evolution. Traditional pipelines identify transient sources from difference (D) images derived by subtracting prior-observed reference images (R) from new science images (N), a process that involves extensive manual inspection. In this study, we present TransientViT, a hybrid convolutional neural network (CNN) - vision transformer (ViT) model to differentiate between transients and image artifacts for the Kilodegree Automatic Transient Survey (KATS). TransientViT utilizes CNNs to reduce the image resolution and a hierarchical attention mechanism to model features globally. We propose a novel KATS-T 200K dataset that combines the difference images with both long- and short-term images, providing a temporally continuous, multidimensional dataset. Using this dataset as the input, TransientViT achieved a superior performance in comparison to other transformer- and CNN-based models, with an overall area under the curve (AUC) of 0.97 and an accuracy of 99.44%. Ablation studies demonstrated the impact of different input channels, multi-input fusion methods, and cross-inference strategies on the model performance. As a final step, a voting-based ensemble to combine the inference results of three NRD images further improved the model's prediction reliability and robustness. This hybrid model will act as a crucial reference for future studies on real/bogus transient classification.