当前位置: 首页 > news >正文

网站详情页用哪个软件做it学校培训学校哪个好

网站详情页用哪个软件做,it学校培训学校哪个好,广告型网站建设,南京医院网站建设方案返回的是文档解析分段内容组成的列表,分段内容默认chunk_size: int 250, chunk_overlap: int 50,250字分段,50分段处保留后面一段的前50字拼接即窗口包含下下一段前面50个字划分 from typing import Union, Listimport jieba import recla…

返回的是文档解析分段内容组成的列表,分段内容默认chunk_size: int = 250, chunk_overlap: int = 50,250字分段,50分段处保留后面一段的前50字拼接即窗口包含下下一段前面50个字划分

from typing import Union, Listimport jieba
import reclass SentenceSplitter:def __init__(self, chunk_size: int = 250, chunk_overlap: int = 50):self.chunk_size = chunk_sizeself.chunk_overlap = chunk_overlapdef split_text(self, text: str) -> List[str]:if self._is_has_chinese(text):return self._split_chinese_text(text)else:return self._split_english_text(text)def _split_chinese_text(self, text: str) -> List[str]:sentence_endings = {'\n', '。', '!', '?', ';', '…'}  # 句末标点符号chunks, current_chunk = [], ''for word in jieba.cut(text):if len(current_chunk) + len(word) > self.chunk_size:chunks.append(current_chunk.strip())current_chunk = wordelse:current_chunk += wordif word[-1] in sentence_endings and len(current_chunk) > self.chunk_size - self.chunk_overlap:chunks.append(current_chunk.strip())current_chunk = ''if current_chunk:chunks.append(current_chunk.strip())if self.chunk_overlap > 0 and len(chunks) > 1:chunks = self._handle_overlap(chunks)return chunksdef _split_english_text(self, text: str) -> List[str]:# 使用正则表达式按句子分割英文文本sentences = re.split(r'(?<=[.!?])\s+', text.replace('\n', ' '))chunks, current_chunk = [], ''for sentence in sentences:if len(current_chunk) + len(sentence) <= self.chunk_size or not current_chunk:current_chunk += (' ' if current_chunk else '') + sentenceelse:chunks.append(current_chunk)current_chunk = sentenceif current_chunk:  # Add the last chunkchunks.append(current_chunk)if self.chunk_overlap > 0 and len(chunks) > 1:chunks = self._handle_overlap(chunks)return chunksdef _is_has_chinese(self, text: str) -> bool:# check if contains chinese charactersif any("\u4e00" <= ch <= "\u9fff" for ch in text):return Trueelse:return Falsedef _handle_overlap(self, chunks: List[str]) -> List[str]:# 处理块间重叠overlapped_chunks = []for i in range(len(chunks) - 1):chunk = chunks[i] + ' ' + chunks[i + 1][:self.chunk_overlap]overlapped_chunks.append(chunk.strip())overlapped_chunks.append(chunks[-1])return overlapped_chunkstext_splitter = SentenceSplitter()def load_file(filepath):print("filepath:",filepath)if filepath.endswith(".md"):contents = extract_text_from_markdown(filepath)elif filepath.endswith(".pdf"):contents = extract_text_from_pdf(filepath)elif filepath.endswith('.docx'):contents = extract_text_from_docx(filepath)else:contents = extract_text_from_txt(filepath)return contentsdef extract_text_from_pdf(file_path: str):"""Extract text content from a PDF file."""import PyPDF2contents = []with open(file_path, 'rb') as f:pdf_reader = PyPDF2.PdfReader(f)for page in pdf_reader.pages:page_text = page.extract_text().strip()raw_text = [text.strip() for text in page_text.splitlines() if text.strip()]new_text = ''for text in raw_text:new_text += textif text[-1] in ['.', '!', '?', '。', '!', '?', '…', ';', ';', ':', ':', '”', '’', ')', '】', '》', '」','』', '〕', '〉', '》', '〗', '〞', '〟', '»', '"', "'", ')', ']', '}']:contents.append(new_text)new_text = ''if new_text:contents.append(new_text)return contentsdef extract_text_from_txt(file_path: str):"""Extract text content from a TXT file."""with open(file_path, 'r', encoding='utf-8') as f:contents = [text.strip() for text in f.readlines() if text.strip()]return contentsdef extract_text_from_docx(file_path: str):"""Extract text content from a DOCX file."""import docxdocument = docx.Document(file_path)contents = [paragraph.text.strip() for paragraph in document.paragraphs if paragraph.text.strip()]return contentsdef extract_text_from_markdown(file_path: str):"""Extract text content from a Markdown file."""import markdownfrom bs4 import BeautifulSoupwith open(file_path, 'r', encoding='utf-8') as f:markdown_text = f.read()html = markdown.markdown(markdown_text)soup = BeautifulSoup(html, 'html.parser')contents = [text.strip() for text in soup.get_text().splitlines() if text.strip()]return contentstexts = load_file(r"C:\Users\lo***山市城市建筑外立面管理条例.docx")
print(texts)

在这里插入图片描述

http://www.khdw.cn/news/54514.html

相关文章:

  • 网站建设公司的公司哪家好seo查询系统
  • 织梦网站tel标签app推广方案范例
  • 福建网站建设模板百度6大核心部门
  • 江苏有什么网站找工程建设人员介绍网络营销的短文
  • 企业cms建站会计培训机构
  • html5做网站链接电工培训技术学校
  • 为什么建手机网站设计公司排名
  • 怎么在记事本上做网站奉节县关键词seo排名优化
  • 网站被挂黑链怎么办营销型网站的分类不包含
  • 做网站图片太大好吗seo外包公司费用
  • 网站推广 网站yahoo搜索引擎入口
  • 微信做淘宝客网站百度统计工具
  • 网站核验单下载百度资讯指数
  • 静态网页模板免费网站世界羽联最新排名
  • 云开发网站贵阳网站建设制作
  • 租号网站怎么做的百度热词指数
  • 连云港网站关键字优化市场写软文的平台有哪些
  • 党建网站设计引流推广多少钱一个
  • 网站首页动画效果北京最新疫情情况
  • 网站上传的图片怎么做的清晰一个新品牌怎样营销推广
  • 帮人做网站赚钱吗我是站长网
  • 怎么做万网网站吗网店营销
  • 酒店网站策划书河北seo诊断培训
  • 做网站是需要多少钱推广软件下载
  • 网站设计教程宁波seo网络推广选哪家
  • 奶茶加盟网站建设百度推广怎么优化关键词的质量
  • 长春火车站咨询电话关键词林俊杰mp3
  • 零基础wordpressseo排名优化培训
  • 做游戏网站年入百万百度地图导航2022最新版下载
  • 为什么用MyEclipse做网站宁波seo公司