a9410738aefix permissions bug when creating directories, need to put in user/group execute bit.
Matt
2015-10-07 08:26:27 -06:00
4600ce0816fix threads from freezing up just because pthread_create() had an error. need to return the thread stack.
Matt
2015-10-07 07:50:44 -06:00
836aa1756dfix threads from freezing up just because pthread_create_thread() had errors. need to return the thread stack to the linked list of thread stacks.
Matt
2015-10-07 07:49:34 -06:00
6d90ea2e5ftry to launch threads even if none need cleanup. hopefully fixes thread freeze.
Matt
2015-10-07 07:29:37 -06:00
cee5d8922aMerge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt
2015-10-05 18:48:06 -06:00
5b605624cafix core dump average": 3.677707812426755e+26,
Matt
2015-10-05 18:47:11 -06:00
a77c9be5b8Merge branch 'diffbot-testing' into diffbot-sam
Matt
2015-10-05 17:32:21 -06:00
df1c7f6e0fupdate qa.cpp syntax test to do &n=100 for gbssStatusCode:0 query
Matt
2015-10-05 17:31:35 -06:00
97b9c99beccorrecting facet min=0
sam
2015-10-05 16:09:52 -07:00
49ec9c99fdDon't restart all items when forcing a list of items into injector.
Zak Betz
2015-10-05 15:33:09 -06:00
9b785a1522allow more than 2gb of mem to be allocated to hold resulting docids.
Matt
2015-10-05 09:35:44 -07:00
757a44b149fix facets when doing > 1 split and first split termlist is empty.
Matt
2015-10-05 10:05:08 -06:00
21b71226a6remove bad fix
Matt
2015-10-05 09:33:02 -06:00
c947252feeAdd gbcapturedate to individual doc's metadata when injecting warcs.
Zak Betz
2015-10-04 01:53:54 -06:00
39214a9dc6Merge branch 'diffbot-testing' into testing
Matt
2015-10-02 19:26:15 -06:00
42cdd5b382fix msg20 getsummary core
Matt Wells
2015-10-02 12:34:08 -07:00
e4adc99c0cfix empty winner tree bug. try to improve rdbcache promotion logic for all caches. -O2 on spider.cpp.
Matt Wells
2015-10-02 12:16:48 -07:00
9daaa4d5afMerge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2015-10-01 19:37:12 -07:00
9178d67b2ffix churn bug in winnerlistcache in spider.cpp so do not add the dolebuf list of spiderrequests back into the cache, but just modify the "jump" in the first 4 bytes of the cached record. because when we re-added it back to the cache it created too much churn and we'd lose cached records unnecessarily.
Matt Wells
2015-10-01 19:35:34 -07:00
6becb55a2bStream warcs instead of downloading them and unzipping them on disk.
Zak Betz
2015-09-30 22:25:59 -06:00
a8e3e4b269if metadata is already in the old xmldoc::ptr_metadata then do not re-add it.
Matt
2015-09-30 21:46:07 -06:00
e06ae06c23Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt
2015-09-30 15:27:07 -06:00
a31c7f5fc8exiting msg
Matt
2015-09-30 15:26:58 -06:00
06aea41611show spiderdb scan progress in spider queue for the collection
Matt Wells
2015-09-30 13:38:22 -07:00
b97546f98cdo not expand about:blank iframes.
Matt Wells
2015-09-30 09:36:04 -07:00
cb4bbe8892Merge branch 'ia' into ia-zak
Matt
2015-09-30 07:58:31 -06:00
d4c677170findex metadata on EDOCUNCHANGED errors, and append new meta data to XmlDoc::ptr_metadata.
Matt
2015-09-30 07:57:40 -06:00
67fc339953prevent out of mem core. actually trying to alloc more than 2GB for search result stuff.
Matt
2015-09-26 21:34:07 -07:00
f0a2f86200Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt
2015-09-25 08:09:25 -07:00
55993e58d5fix cores on gi #0
Matt
2015-09-25 08:09:05 -07:00
2721256c0dshow ip port of bad host
Matt Wells
2015-09-25 07:47:21 -07:00
68a3679faeexit if can not load auth/internetarchive.yml file
Matt
2015-09-25 08:33:25 -06:00
83ac18fff4Merge branch 'master' into testing
Matt
2015-09-25 08:25:19 -06:00
e2fad81227Merge branch 'testing' of github.com:gigablast/open-source-search-engine into testing
Matt
2015-09-25 08:24:54 -06:00
3ce6c7d941Merge branch 'ia-zak' into testing
Matt
2015-09-25 08:24:12 -06:00
1e8f656d30Merge branch 'diffbot-testing' into ia-zak
Matt
2015-09-25 08:23:42 -06:00
93943a0cabsome pages legitamately have no outlinks, no need to think they were banned.
Matt Wells
2015-09-24 14:01:23 -07:00
6454dad6bfRevert "ignore real root, just use seeds, to detect if banned."
Matt Wells
2015-09-24 13:29:20 -07:00
cb60f68e72ignore real root, just use seeds, to detect if banned.
Matt
2015-09-24 14:15:29 -06:00
260864b364urgent fix for core dumps for some queries that have long termlists.
Matt
2015-09-24 11:49:50 -06:00
268b21d552reduce log spam
Matt Wells
2015-09-24 11:37:16 -06:00
9be3f9310efix annoying core dump for some queries in Posdb.cpp
Matt
2015-09-24 11:34:02 -06:00
8a0461b82fMerge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt
2015-09-24 09:10:37 -06:00
d92b153090added 'verify writes' switch to track down data corruption
Matt
2015-09-24 09:10:20 -06:00
100888d691fix file/dir creation permissions bugs
Matt
2015-09-21 12:44:41 -06:00
74cde33a3ajust use the user's umask val for all file/dir creation
Matt
2015-09-21 11:33:38 -06:00
ce7b06fc4dall files made are now group writable. if you don't like that then you can make a special group and set the directory just group writable for that group using chmod g+s <dir>.
Matt
2015-09-21 11:19:34 -06:00
eefbe95ce9Merge branch 'diffbot-testing' into ia-zak
Matt
2015-09-21 10:13:29 -06:00
786ba76d10Merge branch 'ia-zak' into testing
Matt
2015-09-21 10:12:58 -06:00
5aaa08d81aMerge branch 'ia-zak' of github.com:gigablast/open-source-search-engine into ia-zak
Matt
2015-09-21 10:12:38 -06:00
5635695666oom prevention
Matt
2015-09-20 21:42:46 -06:00
d6d5d10a15prevent core from bad root title rec
Matt Wells
2015-09-20 08:26:00 -07:00
69a3cb0999fix corrupt tag with corrupt root title buf from coring
Matt Wells
2015-09-17 21:33:58 -07:00
13e0ba7bfffix bug of having a meta redirect tag in <script> tags. we have to use Xml class to make sure it is a legit refresh tag.
Matt
2015-09-16 11:03:38 -06:00
58e9f56015never let any diffbot error prevent us from retrying a url in subsequent crawl rounds.
Matt
2015-09-16 10:00:11 -06:00
bcdecc63c6expose "urlip" injection parm to provide ip of url being injected to save gigablast from an ip lookup if you want.
Matt
2015-09-16 09:43:15 -06:00
d11761cfd9update graph key
Matt
2015-09-15 15:48:40 -06:00
b2f7e72d8afix core from showing graph to two users
Matt
2015-09-15 15:46:38 -06:00
32d7f5cb97better warc injection load balancing
Matt
2015-09-15 15:04:26 -06:00
28ed7f66afMerge branch 'diffbot-testing' into ia-zak
Matt
2015-09-14 19:16:20 -06:00
f9c4f8fc9atest with 2 first
Matt
2015-09-14 19:16:07 -06:00
fcd4fe3ff3Merge branch 'diffbot-testing' into ia-zak
Matt
2015-09-14 19:04:15 -06:00
16d90e60d1all up to 10 filter threads now
Matt
2015-09-14 19:03:52 -06:00
5caa219c71Reduce false positives by not counting \0 as a non-ascii char in the url.
Zak Betz
2015-09-14 12:24:50 -06:00
5d724cdcc3Check for spaces before non-ascii chars to reduce false positives. Also print the position of non-ascii char to aid debugging. We still need to handle utf8 chars in path.
Zak Betz
2015-09-14 11:11:56 -06:00
519b2c4f42Fix repeating xn--xn-- when there are spaces in the domain. Make gb unittest take a name of the unit test to run.
Zak Betz
2015-09-14 10:24:22 -06:00
519017828cEnable punycode domains for testing. We still need to display them as utf8 on the front end.
Zak Betz
2015-09-14 09:32:25 -06:00
5622ca47eeWork on non-ascii domain names. It works on correct inputs, but will crash on some non correct inputs, so it is forced to be disabled.
Zak Betz
2015-09-14 00:34:44 -06:00
b4bac67fdfMerge branch 'diffbot-testing' into testing
Matt
2015-09-13 19:40:57 -06:00
710661a0f3Merge branch 'ia-zak' into ia
Matt
2015-09-13 19:40:21 -06:00
ffa465b942Merge branch 'ia' of github.com:gigablast/open-source-search-engine into ia
Matt
2015-09-13 19:40:17 -06:00
f1db3aca94Merge branch 'diffbot-testing' into ia-zak
Matt
2015-09-13 19:39:50 -06:00
b8e4046d61remove unnecessary line
Matt Wells
2015-09-13 17:54:38 -07:00
65613feb4cfix bug of not using part files when generating map
Matt
2015-09-13 17:52:40 -07:00
cb6ca24c26Allow nospider and noquery on the same host. Fix punycoding of non-ascii domains.
Zak Betz
2015-09-13 17:15:31 -06:00
3444c67851exit faster.
Matt
2015-09-13 14:28:35 -07:00
6370054bd5fix problem of adding too many collections and not wrapping the collnum_t id
Matt Wells
2015-09-13 14:21:52 -07:00